Serveur surchargé

Pour smartctl et les raid Megaraid, tu peux voir ici:
thomas-krenn.com/en/wiki/Sm … Controller

[quote=“Minus”]
c’est bon mais pour combien de temps!!! Faut continuer les investigations

edit: je n’ai pas fait de mise à jour dernierement sur ce serveur

Merci[/quote]

As tu fait un parcours de l’arborescence: Tu fais

cd / ls -lR
et tu regardes si ça coinces.

Y-a-t-il eu une mise à jour d’un client?

le ls -lR est passé sans probleme du premier coup, aucune attente, de même que un simple ls -l sur la racine; alors que ça mettait des plombes avant

Il y a eu des mises à jour client mais rien qui aurait pu toucher au montage nfs.

Voici les tests smartctl:

smartctl -a -d megaraid,1  /dev/sda
smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

/dev/sda [megaraid_disk_01] [SAT]: Device open changed type from 'megaraid' to 'sat'
=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar SE Serial ATA family
Device Model:     WDC WD1600JS-75NCB3
Serial Number:    WD-WCANMF521738
Firmware Version: 10.02E04
User Capacity:    160 000 000 000 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Mon Jun  8 15:20:18 2015 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
                                        was suspended by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                 (5400) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  64) minutes.
Conveyance self-test routine
recommended polling time:        (   6) minutes.
SCT capabilities:              (0x103f) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0003   185   185   021    Pre-fail  Always       -       3733
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       63
  5 Reallocated_Sector_Ct   0x0033   199   199   140    Pre-fail  Always       -       4
  7 Seek_Error_Rate         0x000f   200   200   051    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   007   007   000    Old_age   Always       -       68318
 10 Spin_Retry_Count        0x0013   100   253   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0012   100   253   051    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       63
190 Airflow_Temperature_Cel 0x0022   078   034   045    Old_age   Always   In_the_past 22
194 Temperature_Celsius     0x0022   125   081   000    Old_age   Always       -       22
196 Reallocated_Event_Count 0x0032   199   199   000    Old_age   Always       -       1
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       9
200 Multi_Zone_Error_Rate   0x0009   200   200   051    Pre-fail  Offline      -       0

SMART Error Log Version: 1
ATA Error Count: 126 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 126 occurred at disk power-on lifetime: 62215 hours (2592 days + 7 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  01 51 00 00 00 4f 40  Error: AMNF at LBA = 0x004f0000 = 5177344

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  42 00 00 00 00 4f 00 00   5d+18:02:00.891  READ VERIFY SECTOR(S) EXT
  42 00 00 00 00 4f 00 00   5d+18:01:58.894  READ VERIFY SECTOR(S) EXT
  42 00 00 00 00 4f 00 00   5d+18:01:56.881  READ VERIFY SECTOR(S) EXT
  42 00 00 00 00 4f 00 00   5d+18:01:54.867  READ VERIFY SECTOR(S) EXT
  42 00 00 00 00 4f 00 00   5d+18:01:52.853  READ VERIFY SECTOR(S) EXT

Error 125 occurred at disk power-on lifetime: 62215 hours (2592 days + 7 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  01 51 00 00 00 4f 40  Error: AMNF at LBA = 0x004f0000 = 5177344

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  42 00 00 00 00 4f 00 00   5d+18:01:58.894  READ VERIFY SECTOR(S) EXT
  42 00 00 00 00 4f 00 00   5d+18:01:56.881  READ VERIFY SECTOR(S) EXT
  42 00 00 00 00 4f 00 00   5d+18:01:54.867  READ VERIFY SECTOR(S) EXT
  42 00 00 00 00 4f 00 00   5d+18:01:52.853  READ VERIFY SECTOR(S) EXT
  42 00 00 00 00 4f 00 00   5d+18:01:50.848  READ VERIFY SECTOR(S) EXT

Error 124 occurred at disk power-on lifetime: 62215 hours (2592 days + 7 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  01 51 00 00 00 4f 40  Error: AMNF at LBA = 0x004f0000 = 5177344

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  42 00 00 00 00 4f 00 00   5d+18:01:56.881  READ VERIFY SECTOR(S) EXT
  42 00 00 00 00 4f 00 00   5d+18:01:54.867  READ VERIFY SECTOR(S) EXT
  42 00 00 00 00 4f 00 00   5d+18:01:52.853  READ VERIFY SECTOR(S) EXT
  42 00 00 00 00 4f 00 00   5d+18:01:50.848  READ VERIFY SECTOR(S) EXT
  42 00 00 00 80 4e 00 00   5d+18:01:50.573  READ VERIFY SECTOR(S) EXT

Error 123 occurred at disk power-on lifetime: 62215 hours (2592 days + 7 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  01 51 00 00 00 4f 40  Error: AMNF at LBA = 0x004f0000 = 5177344

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  42 00 00 00 00 4f 00 00   5d+18:01:54.867  READ VERIFY SECTOR(S) EXT
  42 00 00 00 00 4f 00 00   5d+18:01:52.853  READ VERIFY SECTOR(S) EXT
  42 00 00 00 00 4f 00 00   5d+18:01:50.848  READ VERIFY SECTOR(S) EXT
  42 00 00 00 80 4e 00 00   5d+18:01:50.573  READ VERIFY SECTOR(S) EXT
  42 00 00 00 00 4e 00 00   5d+18:01:50.300  READ VERIFY SECTOR(S) EXT

Error 122 occurred at disk power-on lifetime: 62215 hours (2592 days + 7 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  01 51 00 00 00 4f 40  Error: AMNF at LBA = 0x004f0000 = 5177344

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  42 00 00 00 00 4f 00 00   5d+18:01:52.853  READ VERIFY SECTOR(S) EXT
  42 00 00 00 00 4f 00 00   5d+18:01:50.848  READ VERIFY SECTOR(S) EXT
  42 00 00 00 80 4e 00 00   5d+18:01:50.573  READ VERIFY SECTOR(S) EXT
  42 00 00 00 00 4e 00 00   5d+18:01:50.300  READ VERIFY SECTOR(S) EXT
  42 00 00 00 80 4d 00 00   5d+18:01:50.024  READ VERIFY SECTOR(S) EXT

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%         1         -
# 2  Short offline       Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
smartctl -a -d megaraid,2  /dev/sda
smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

/dev/sda [megaraid_disk_02] [SAT]: Device open changed type from 'megaraid' to 'sat'
=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar SE Serial ATA family
Device Model:     WDC WD1600JS-75NCB3
Serial Number:    WD-WCANMF319433
Firmware Version: 10.02E04
User Capacity:    160 000 000 000 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Mon Jun  8 15:21:13 2015 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
                                        was suspended by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                 (4980) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  60) minutes.
Conveyance self-test routine
recommended polling time:        (   6) minutes.
SCT capabilities:              (0x103f) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0003   184   184   021    Pre-fail  Always       -       3758
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       61
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   200   200   051    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   007   007   000    Old_age   Always       -       68327
 10 Spin_Retry_Count        0x0013   100   253   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0012   100   253   051    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       61
190 Airflow_Temperature_Cel 0x0022   075   048   045    Old_age   Always       -       25
194 Temperature_Celsius     0x0022   122   095   000    Old_age   Always       -       25
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0009   200   200   051    Pre-fail  Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%         1         -
# 2  Short offline       Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
smartctl -a -d megaraid,3  /dev/sda
smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

/dev/sda [megaraid_disk_03] [SAT]: Device open changed type from 'megaraid' to 'sat'
=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar SE Serial ATA family
Device Model:     WDC WD1600JS-75NCB3
Serial Number:    WD-WCANMF543667
Firmware Version: 10.02E04
User Capacity:    160 000 000 000 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Mon Jun  8 15:22:12 2015 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                 (4980) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  60) minutes.
Conveyance self-test routine
recommended polling time:        (   6) minutes.
SCT capabilities:              (0x103f) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0003   184   184   021    Pre-fail  Always       -       3758
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       61
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   100   253   051    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   007   007   000    Old_age   Always       -       68328
 10 Spin_Retry_Count        0x0013   100   253   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0012   100   253   051    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       61
190 Airflow_Temperature_Cel 0x0022   077   054   045    Old_age   Always       -       23
194 Temperature_Celsius     0x0022   124   101   000    Old_age   Always       -       23
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0009   200   200   051    Pre-fail  Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%         1         -
# 2  Short offline       Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
smartctl -a -d megaraid,4  /dev/sda
smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

/dev/sda [megaraid_disk_04] [SAT]: Device open changed type from 'megaraid' to 'sat'
=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar SE Serial ATA family
Device Model:     WDC WD1600JS-75NCB3
Serial Number:    WD-WCANMF319434
Firmware Version: 10.02E04
User Capacity:    160 000 000 000 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Mon Jun  8 15:22:56 2015 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                 (4980) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  60) minutes.
Conveyance self-test routine
recommended polling time:        (   6) minutes.
SCT capabilities:              (0x103f) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0003   185   185   021    Pre-fail  Always       -       3750
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       61
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   100   253   051    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   007   007   000    Old_age   Always       -       68320
 10 Spin_Retry_Count        0x0013   100   253   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0012   100   253   051    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       61
190 Airflow_Temperature_Cel 0x0022   077   050   045    Old_age   Always       -       23
194 Temperature_Celsius     0x0022   124   097   000    Old_age   Always       -       23
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0009   200   200   051    Pre-fail  Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%         1         -
# 2  Short offline       Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
smartctl -a -d megaraid,5  /dev/sda
smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

/dev/sda [megaraid_disk_05] [SAT]: Device open changed type from 'megaraid' to 'sat'
=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar SE Serial ATA family
Device Model:     WDC WD1600JS-75NCB3
Serial Number:    WD-WCANMF521499
Firmware Version: 10.02E04
User Capacity:    160 000 000 000 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Mon Jun  8 15:23:30 2015 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
                                        was suspended by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                 (4980) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  60) minutes.
Conveyance self-test routine
recommended polling time:        (   6) minutes.
SCT capabilities:              (0x103f) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0003   186   186   021    Pre-fail  Always       -       3675
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       61
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   200   200   051    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   007   007   000    Old_age   Always       -       68320
 10 Spin_Retry_Count        0x0013   100   253   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0012   100   253   051    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       61
190 Airflow_Temperature_Cel 0x0022   076   052   045    Old_age   Always       -       24
194 Temperature_Celsius     0x0022   123   099   000    Old_age   Always       -       24
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0009   200   200   051    Pre-fail  Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%         1         -
# 2  Short offline       Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

J’en déduis que le disque 1 a bien un soucis c’est bien ça ? (Bizarre car il n’est pas donné comme étant fail sur le serveur lui même (couleur de la led correspondant au disque).

Le disque 1 a eu un souci (coup de chaud on dirait) mais il y a 6000 heures soit près d’un an. Tes disques semblent assez vieux (7/8 ans) et tous de même age (attention, ils risquent de péter en même temps).Mais a priori pas d’erreurs graves récentes.
Le souci peut venir de là mais rien de permet (pour moi) de l’affirmer avec certitude.
Y-a-t-il dans les syslog du 4 juin une erreur genre r/w error, lseek error, IO error, etc. (Voir /var/log/syslog.3.gz, /var/log/syslog.4.gz et /var/log/syslog.5.gz).
Tu peux également chercher des erreurs NFS dans /var/log/daemon.log ou /var/log/daemon.log.1

C’est bizarre tout de même. Coté clients, pas de soucis signalés le 3 ou 4 juin?

Oui j’ai bien vu l’info
190 Airflow_Temperature_Cel 0x0022 078 034 045 Old_age Always In_the_past 22

mais ou as tu vu la valeur 6000 heures?

Ils ne semblent pas assez vieux, ils sont vieux, c’est d’ailleurs pour cela que je fais des backup réguliers et que je fais une prière à chaque fois que je reboot le serveur :stuck_out_tongue:

[quote=“fran.b”]Y-a-t-il dans les syslog du 4 juin une erreur genre r/w error, lseek error, IO error, etc. (Voir /var/log/syslog.3.gz, /var/log/syslog.4.gz et /var/log/syslog.5.gz).
Tu peux également chercher des erreurs NFS dans /var/log/daemon.log ou /var/log/daemon.log.1
C’est bizarre tout de même. Coté clients, pas de soucis signalés le 3 ou 4 juin?[/quote]

alors:

Coté client aucune erreur si ce n’est la latence. D’ailleurs ce week end le serveur était tellement surchargé que le VPN (hébergé sur ce même serveur) a refusé toutes connexions. Le reboot a tout remis en ordre de marche.

les seuls messages “error” que j’ai trouvé dans les fichiers daemon et syslog (je les ai tous fait) sont:

daemon.log.1
Line 4395: Jun  8 09:11:03 server1 nrpe[2389]: Continuing with errors...

syslog.1
Line 15540: Jun  8 09:10:59 server1 kernel: [    0.374033] ACPI Error (psargs-0359): [CDW1] Namespace lookup failure, AE_NOT_FOUND
Line 15541: Jun  8 09:10:59 server1 kernel: [    0.374042] ACPI Error (psparse-0537): Method parse/execution failed [\_SB_._OSC] (Node ffff88022fc29680), AE_NOT_FOUND
Line 16043: Jun  8 09:10:59 server1 kernel: [    4.034044] PM: Error -22 checking image file
Line 16120: Jun  8 09:10:59 server1 kernel: [    8.329051] Error: Driver 'pcspkr' is already registered, aborting...
Line 16222: Jun  8 09:11:03 server1 nrpe[2389]: Continuing with errors...

Pour les messages NFS voila ce que j’ai trouvé

syslog.1 (7 hits)
Line 16155: Jun  8 09:10:59 server1 kernel: [   95.457547] RPC: Registered tcp NFSv4.1 backchannel transport module.
Line 16159: Jun  8 09:10:59 server1 kernel: [   95.686222] FS-Cache: Netfs 'nfs' registered for caching
Line 16160: Jun  8 09:10:59 server1 kernel: [   95.753504] Installing knfsd (copyright (C) 1996 okir@monad.swb.de).
Line 16163: Jun  8 09:10:59 server1 kernel: [   96.688442] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
Line 16163: Jun  8 09:10:59 server1 kernel: [   96.688442] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
Line 16163: Jun  8 09:10:59 server1 kernel: [   96.688442] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
Line 16164: Jun  8 09:10:59 server1 kernel: [   96.698796] NFSD: starting 90-second grace period

syslog.4 (1 hit)
Line 126: May 11 11:15:48 server1 kernel: [1199388.146167] nfsd: peername failed (err 107)!

En tout cas merci beaucoup pour ton aide

Oui j’ai bien vu l’info
190 Airflow_Temperature_Cel 0x0022 078 034 045 Old_age Always In_the_past 22

mais ou as tu vu la valeur 6000 heures?[/quote]
Ici:

[quote]Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It “wraps” after 49.710 days.

Error 126 occurred at disk power-on lifetime: 62215 hours (2592 days + 7 hours)
When the command that caused the error occurred, the device was active or idle.
[/quote]

Les erreurs ici sont les erreurs graves.

Quelque chose qui m’étonne est que ta machine semble avoir été redémarrée le 8 juin vers 9:10,

(ligne Jun 8 09:10:59 server1 kernel: [ 96.698796]...) Tu confirmes?

Oui c’est moi qui l’ai redémarré Lundi matin.

La VM VPN hébergée sur ce serveur était bloquée et le load average était aux alentours de 45.

Comme il y avait quand même urgence de reprendre la main, j’y suis allé un peu à la bourrin, mais je n’avais pas trop le choix.

Bon, y’a vraiment rien qui indique un souci. Reste plus qu’à attendre que ça recommence:

  • Un disque qui poserait souci n’aurait pas entrainé de tels soucis (les RAIDs c’est fait pour ça).
  • Un souci NFS reste curieux sur un serveur mais par expérience et lecture de docs diverses, je sais NFS assez fragile et capable d’entrainer de tels gels d’où mon soupcon net.

Mais à l’heure actuelle rien ne vient étayer une hypothèse quelconque mis à part les messages d’erreurs que tu as eu qui sont liés à NFS. Mais est cd la cause ou l’effet?

Si ça recommence, je te suggère de faire dans l’ordre sur le serveur un «rpcinfo -p» pour voir les services qui tournent et un redémarragfe du serveur NFS.

Il y a eu 22 processus update_db gelés, ce qui signifie que le problème aurait eu lieu 22 jours avant le 5 juin soit le jeudi 14 mai. Mais j’ai vu qu’il y avait une sauvegarde incrémentale gelée datant du 22 avril! Ça n’est pas cohérent sauf si on suppose que le problème va et vient avec donc des updatedb qui auraient fonctionné det d’autres non. Mais dans ce cas, tu trouveras peut être dans les logs des traces de cet incident. Ça vaut peut être le coup de remonter assez loin dans les logs à la recherche de soucis NFS, portmap ou bien des erreurs IO

Je te suggère aussi de surveiller ta machine en regardant si des processus updatedb et d’autres sont gelés. Tu pourras voir alors ce qu’il se passe, visiblement le serveur met du temps à dérailler complètement…

Regarde si il te reste des logs

Merci beaucoup pour ton aide

Je vais en effet surveiller ce serveur plus que de normal

Pour les logs, je suis remonté au plus ancien stocké dans /var/log/

Bon, on laisse donc ce post ouvert en attente du prochain bug

encore merci :023

Petit retour aprés 2 semaines d’activité.

Le serveur se porte bien (faut dire que j’ai arrété quelques VM d’une utilité assez secondaire), pas ou peu de latence et un load balancing inferieur à 0.5 la pluspart du temps.

Bon, à suivre…

Bonjour

Bon je reviens vers vous pour un probleme du même type, c’est pour cela que je complete donc ce post.

Le serveur objet de ce post est mort, je l’ai donc changé, et donc ouf fini les virtualisation avec à la fois du KVM et du openVZ, la base est windows server et toutes les VM debian vont etre virtualisée sous HyperV.

Par contre l’autre serveur se met a faire des siennes. Il me replique aux kilometres ceci

576965 ?        Ss     0:00   /bin/sh -c test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc
 576967 ?        S      0:00     run-parts --report /etc/cron.daily
 577026 ?        S      0:00       /bin/sh /etc/cron.daily/locate
 577031 ?        SN     0:00         /bin/sh /usr/bin/updatedb.findutils
 577039 ?        SN     0:00           /bin/sh /usr/bin/updatedb.findutils
 577064 ?        SN     0:00             su nobody -s /bin/sh -c /usr/bin/find / -ignore_readdir_race   
 577076 ?        SNs    0:00               sh -c /usr/bin/find / -ignore_readdir_race      \( -fstype NF
 577077 ?        DN     0:05                 /usr/bin/find / -ignore_readdir_race ( -fstype NFS -o -fsty

et donc ça surcharge le serveur

Comment me debarrasser de ce truc? car j’ai simplement essayé de le killer mais il reviens constamment.

par avance merci

Vire le /etc/cron.daily/locate

mais il serait intéressant de regarder ce fichier, d’habitude il y a un verrou /var/lib/locate/daily.lock qui bloque la création d’instance multiples. Y-a-til un droit d’écriture du démon sur ce répertoire?

Bonjour

Je l’ai viré, enfin je l’ai copié dans le dossier root et je l’ai effacé de /etc/cron.daily

voila ce que contient ce fichier:

#! /bin/sh

set -e

# cron script to update the `locatedb' database.
#
# Written by Ian A. Murdock <imurdock@debian.org> and 
#            Kevin Dalley <kevin@aimnet.com>

# Please consult updatedb(1) and /usr/share/doc/locate/README.Debian

[ -e /usr/bin/updatedb.findutils ] || exit 0

if [ "$(id -u)" != "0" ]; then
        echo "You must be root."
        exit 1
fi
 
# Global options for invocations of find(1)
FINDOPTIONS='-ignore_readdir_race'
# filesystems which are pruned from updatedb database
PRUNEFS="NFS nfs nfs4 afs binfmt_misc proc smbfs autofs iso9660 ncpfs coda devpts ftpfs devfs mfs shfs sysfs cifs lustre_lite tmpfs usbfs udf ocfs2"
# paths which are pruned from updatedb database
PRUNEPATHS="/tmp /usr/tmp /var/tmp /afs /amd /alex /var/spool /sfs /media /var/lib/schroot/mount"
# netpaths which are added
NETPATHS=""
# run find as this user
LOCALUSER="nobody"
# cron.daily/find: run at this priority -- higher number means lower priority
# (this is relative to the default which cron sets, which is usually +5)
NICE=10

# I/O priority
# 1 for real time, 2 for best-effort, 3 for idle ("3" only allowed for root)
IONICE_CLASS=3
# 0-7 (only valid for IONICE_CLASS 1 and 2), 0=highest, 7=lowest 
IONICE_PRIORITY=7

# allow keeping local customizations in a separate file
if [ -r /etc/updatedb.findutils.cron.local ] ; then
        . /etc/updatedb.findutils.cron.local
fi
export FINDOPTIONS PRUNEFS PRUNEPATHS NETPATHS LOCALUSER

# Set the task to run with desired I/O priority if possible
# Linux supports io scheduling priorities and classes since
# 2.6.13 with the CFQ io scheduler
if [ -x /usr/bin/ionice ] && [ "${UPDATDB_NO_IONICE}" = "" ]; then
        # don't run ionice if kernel version < 2.6.13
        KVER=$(uname -r)
        case "$KVER" in
                2.[012345]*) ;;
                2.6.[0-9]) ;;
                2.6.[0-9].*) ;;
                2.6.1[012]*) ;;
                *)
                # Avoid providing "-n" when IONICE_CLASS isn't 1 or 2
                case "$IONICE_CLASS" in
                        1|2) priority="-n ${IONICE_PRIORITY:-7}" ;;
                        *) priority="" ;;
                esac
                ionice -c $IONICE_CLASS $priority -p $$ > /dev/null 2>&1 || true
                ;;
        esac
fi

if getent passwd $LOCALUSER > /dev/null ; then
  cd / && nice -n ${NICE:-10} updatedb.findutils 2>/dev/null
else
  echo "User $LOCALUSER does not exist."
  exit 1
fi

malheureusement, même sans ce fichier, la pollution continue et il continue a me sortir la meme chose dans les processus

Salut
je laisse fran.b gérer l’affaire :slightly_smiling:
mai petite suggestion qui peux aider:
c’est plutôt un contournement , vu que je ne sais pas ce que fait ta tache cron
.
tu peux changer 2 paramètre du moins avec htop pour te tirer un peux d’affaire.

plutôt que de kill change le (“nice” plus sure du nom) par le passer c’était la touche a il me semble (pas certain) tu peux aussi le forcer a aller que sur 1 seul coeur,
Pour avoir la main sur le syteme plus facilement, tu peux aussi stopper le processus sans forcement le kill. sa permettra de voir si la charge tombe ou pas.
Dans ta tache cron change le “nice” + le ceur . aux pire tu peux utilise “strace” mai bonne lecture ensuite :s

etant donné que ce n’est pas moi qui ai monté ce serveur, et qu’il n’y a pas eu de transfert de competence, je découvre peu à peu les rouage de ce qui a été mis en place

mais bon 199 processus ca commence à faire lourd

je ne comprend pas bien ce que tu veux dire dans ton post :017

Ces processus ont été lancé par cron.daily. Il te faut tuer les existants, avec le script en moins, il ne devrait plus y en avoir d’autres désormais. Fais un «grep -r updatedb /etc/cro*» pour vérifier.

Bonjour

je viens de faire le “grep -r updatedb /etc/cro*” je n’ai effectivement plus de retour

J’ai killé tous les processus encore en cours et le problème ne semble ne pas revenir.

Ca fait du bien de passer d’un load average 35.00 à 0.07

merci pour votre aide

[/quote]

Heuu moi je vois 62.215 … donc loin des 6.000

[/quote]

Heuu moi je vois 62.215 … donc loin des 6.000[/quote]
La durée de fonctionnement du disque est de

[quote] 7 Seek_Error_Rate 0x000f 200 200 051 Pre-fail Always - 0
9 Power_On_Hours 0x0032 007 007 000 Old_age Always - 68318
10 Spin_Retry_Count 0x0013 100 253 051 Pre-fail Always - 0[/quote]68318 heures, le souci a eu lieu à la 62215 heure de fonctionnement soit en gros 6000 heures auparavant. (Où lis tu 62,215?)

Ok je comprend mieux.
J’ai mis un point, pas une virgule, juste pour mieux lire le chiffre.