SMART error (OfflineUncorrectableSector) detected on host

bonjour, je reçois toujours ce message mail:

This message was generated by the smartd daemon running on:

   host name:  debian
   DNS domain: [Empty]

The following warning/error was logged by the smartd daemon:

Device: /dev/sda [SAT], 1 Offline uncorrectable sectors

Device info:
WDC WD5000BPVT-16HXZT3, S/N:WD-WXJ1A91K7527, WWN:5-0014ee-601c704f3, FW:03.01A03, 500 GB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
The original message about this issue was sent at Wed May 11 12:44:13 2016 CEST
Another message will be sent in 24 hours if the problem persists.

smartctl -s on -a /dev/sda

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Scorpio Blue Serial ATA (AF)
Device Model:     WDC WD5000BPVT-16HXZT3
Serial Number:    WD-WXJ1A91K7527
LU WWN Device Id: 5 0014ee 601c704f3
Firmware Version: 03.01A03
User Capacity:    500 107 862 016 bytes [500 GB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Sat May 28 08:52:20 2016 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF ENABLE/DISABLE COMMANDS SECTION ===
SMART Enabled.

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
					was suspended by an interrupting command from host.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(12180) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 121) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x7035)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       1
  3 Spin_Up_Time            0x0027   178   174   021    Pre-fail  Always       -       2083
  4 Start_Stop_Count        0x0032   085   085   000    Old_age   Always       -       15976
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002f   200   200   051    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   073   073   000    Old_age   Always       -       19937
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   097   097   000    Old_age   Always       -       3440
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       418
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       2010413
194 Temperature_Celsius     0x0022   120   091   000    Old_age   Always       -       27
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       1
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       1

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     19931         -
# 2  Short offline       Completed without error       00%     12681         -
# 3  Short offline       Completed without error       00%      6903         -
# 4  Extended offline    Aborted by host               50%      6903         -
# 5  Short offline       Aborted by host               10%      6902         -
# 6  Short offline       Aborted by host               80%      6902         -
# 7  Short offline       Completed without error       00%      6693         -
# 8  Short offline       Completed without error       00%      6548         -
# 9  Short offline       Completed without error       00%      6245         -
#10  Short offline       Completed without error       00%      6245         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

cat /var/log/syslog | grep Prefailure
Apr 18 08:31:29 debian smartd[718]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 179 to 180
Apr 19 10:33:42 debian smartd[718]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 180 to 179
Apr 23 18:31:06 debian smartd[718]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 179 to 178
Apr 25 08:56:19 debian smartd[687]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 178 to 177
Apr 25 15:27:24 debian smartd[748]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 177 to 175
Apr 27 10:06:33 debian smartd[703]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 175 to 177
Apr 27 14:36:11 debian smartd[703]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 177 to 178
Apr 27 15:06:36 debian smartd[703]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 178 to 181
Apr 28 06:18:03 debian smartd[759]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 100 to 99
Apr 28 06:18:03 debian smartd[759]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 181 to 180
Apr 28 08:11:14 debian smartd[763]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 100 to 111
Apr 28 08:23:06 debian smartd[786]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 180 to 179
Apr 30 14:12:30 debian smartd[761]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 179 to 177
May 4 19:39:29 debian smartd[761]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 111 to 116
May 4 19:39:31 debian smartd[761]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 177 to 176
May 5 08:08:45 debian smartd[761]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 176 to 177
May 6 16:01:07 debian smartd[761]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 116 to 117
May 6 16:01:07 debian smartd[761]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 177 to 178
May 7 08:59:04 debian smartd[761]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 178 to 179
May 7 13:19:07 debian smartd[761]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 179 to 178
May 8 10:09:33 debian smartd[761]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 117 to 105
May 8 10:39:33 debian smartd[761]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 105 to 114
May 10 13:03:18 debian smartd[761]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 114 to 116
May 10 13:03:19 debian smartd[761]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 178 to 177
May 10 19:39:34 debian smartd[761]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 116 to 117
May 11 18:42:06 debian smartd[696]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 175 to 176
May 11 19:38:45 debian smartd[704]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 176 to 175
May 12 08:20:43 debian smartd[704]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 175 to 176
May 12 12:02:16 debian smartd[704]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 176 to 177
May 12 18:29:02 debian smartd[725]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 177 to 176
May 12 19:20:17 debian smartd[697]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 176 to 175
May 13 08:22:06 debian smartd[697]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 175 to 176
May 14 08:07:01 debian smartd[699]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 176 to 177
May 14 14:46:36 debian smartd[702]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 177 to 176
May 14 15:16:34 debian smartd[702]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 176 to 175
May 15 07:45:21 debian smartd[694]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 175 to 176
May 17 08:33:12 debian smartd[694]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 176 to 177
May 17 15:49:19 debian smartd[694]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 177 to 178
May 17 19:15:28 debian smartd[694]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 178 to 177
May 19 20:48:57 debian smartd[692]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 177 to 175
May 20 07:47:02 debian smartd[692]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 175 to 176
May 20 12:39:12 debian smartd[692]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 176 to 177
May 21 10:05:42 debian smartd[692]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 177 to 178
May 26 07:59:16 debian smartd[692]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 178 to 179
May 26 10:43:43 debian smartd[692]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 179 to 178

le disque est sain, seulement une erreur de: Raw_Read_Error_Rate

“1 Offline uncorrectable sectors”, ce n’est pas le signe d’un disque en bonne santé. Il y a actuellement un secteur illisible détecté par le test en tâche de fond du disque. La bonne nouvelle, c’est que le système n’a pas eu besoin d’accéder à ce secteur, et n’a pas donc provoqué d’erreur visible.

salut,
Pour avoir les statistiques

smartctl -A

Exemple

# smartctl -A /dev/sda
smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.5.0-2-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   062    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0025   100   100   040    Pre-fail  Offline      -       2460
  3 Spin_Up_Time            0x0023   170   100   033    Pre-fail  Always       -       2
  4 Start_Stop_Count        0x0032   095   095   000    Old_age   Always       -       8804
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002f   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0025   100   100   040    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0032   094   094   000    Old_age   Always       -       2748
 10 Spin_Retry_Count        0x0033   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       2195
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0033   100   100   097    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       4294901762
188 Command_Timeout         0x0032   087   059   000    Old_age   Always       -       410
190 Airflow_Temperature_Cel 0x0022   057   040   045    Old_age   Always   In_the_past 43 (Min/Max 29/45)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       51
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       6029404
193 Load_Cycle_Count        0x0032   098   098   000    Old_age   Always       -       26868
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0036   100   100   000    Old_age   Always       -       0
223 Load_Retry_Count        0x002a   100   100   000    Old_age   Always       -       0

ils étaient déjà indiqué dans le premier post je les redonne, on peut voir à l’attribut 198 qu’il n’y a qu’un seul secteur endommagé, je ne pense donc pas à une défaillance du sous système mécanique mais un défaut de surface de ce secteur, si ça reste toujours à un seul secteur bien sûr, mais comment indiquer au système d’ignorer ce secteur et ne plus recevoir ce mail d’avertissement ? (et ne pas l’utiliser bien sûr)

smartctl -A /dev/sda
smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-4-amd64] (local build)
Copyright © 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 1
3 Spin_Up_Time 0x0027 178 174 021 Pre-fail Always - 2100
4 Start_Stop_Count 0x0032 085 085 000 Old_age Always - 15977
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
9 Power_On_Hours 0x0032 073 073 000 Old_age Always - 19945
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 097 097 000 Old_age Always - 3441
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 418
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 2012658
194 Temperature_Celsius 0x0022 120 091 000 Old_age Always - 27
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 1
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 1

Pour ne pas l’utiliser, il faut exécuter fsck avec l’option -c sur la partition qui le contient pour faire une vérification de surface afin de le détecter et de le marquer comme défectueux. Problème : on ne sait pas dans quelle partition il est, et ça ne marche que sur un système de fichiers (ext4, btrfs…), pas le swap ou autre type de partition.

Tu peux exécuter
badblocks -sv /dev/sda
(long) pour examiner tout le disque et localiser le secteur défectueux (ainsi que d’autres qui n’auraient pas encore été détectés).

Concernant le mail, j’ai peur que smartd le renvoie jusqu’à la fin des temps tant que l’attribut SMART ne sera pas revenu à 0.

on peut modifier
/etc/smartd.conf

où il est indiqué
-m ADD Send warning email to ADD for -H, -l error, -l selftest, and -f

-M TYPE Modify email warning behavior (see man page)

tu dois avoir l’option -m par defaut
DEVICESCAN -d removable -n standby -m root -M exec /usr/share/smartmontools/smartd-runner

merci les gars, je vais regarder tout ça.

oui mais mkfs -c /partition ext4 supprime les données, je ne peux donc pas faire cette commande.
sinon pour ne plus recevoir le mail, ni les messages dans syslog, j’ai désactivé le lancement automatique de smart, je penserais à vérifier périodiquement.

Pardon, je voulais dire fsck. J’ai corrigé mon message précédent.

Tu peux installer webmin, interface graphique d’administration

http://www.webmin.com/deb.html

qui s’utilise via le navigateur
https://localhost:10000/

Bonjour,

Pour le mail, ce n’est pas très compliqué, si on va au plus simple.

nano /etc/smartd.conf

Ajouter à la fin les disques à vérifier par le service :

/dev/sda -a
/dev/sdc -a

Ainsi si /dev/sdb est problématique, plus de mail.
Référence : wiki archlinux SMART

Ici un “smartcl -a” d’un de mes disques :

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 195 193 051 Pre-fail Always - 11808
3 Spin_Up_Time 0x0027 165 161 021 Pre-fail Always - 6733
4 Start_Stop_Count 0x0032 075 075 000 Old_age Always - 25565
5 Reallocated_Sector_Ct 0x0033 191 191 140 Pre-fail Always - 193
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 075 075 000 Old_age Always - 18774
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 091 091 000 Old_age Always - 9742
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 384
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 1377832
194 Temperature_Celsius 0x0022 117 104 000 Old_age Always - 33
196 Reallocated_Event_Count 0x0032 110 110 000 Old_age Always - 90
197 Current_Pending_Sector 0x0032 198 001 000 Old_age Always - 872
198 Offline_Uncorrectable 0x0030 199 199 000 Old_age Offline - 335
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 1
200 Multi_Zone_Error_Rate 0x0008 195 193 000 Old_age Offline - 1485

Le disque est encore en usage (il ne contient pas de données importantes) :slight_smile:
Après avoir repéré les zones de secteurs problématiques, j’ai crée deux partitions :

1 To | petit espace non alloué problématique | 500 Go | restant problématique

Au vu d’un dernier “smarctl -t short /dev/sdb” suivi d’un (après 2, 3 minute) “smartctl -a /dev/sdb”, je vais devoir réintervenir à cause d’un secteur problématique sur la partition de 500Go. :unamused:

1 Short offline Completed: read failure 90% 18774 2520815424

Sinon, mieux que “smartctl -a”, “smartctl -x”, le rapport est un peu plus complet.

Si le secteur est identifié, par exemple secteur 568254 sur le disque /dev/sdb, les commandes suivantes sont assez efficaces pour tenter de forcer la réallocation du secteur défecteux.
Attention, le --write-sector écrase les données sur le secteur…

Pour tenter de forcer sa lecture, en vue d’une eventuelle réallocation automatique par le disque. Il est bon de le tenter plusieurs fois en cas d’échec :

hdparm --read-sector 568254 --yes-i-know-what-i-am-doing /dev/sdb

Pour forcer sa réallocation automatique par le disque, si la problème revient régulièrement sur le même secteur, il semble qu’enchaîner ces trois commande fonctionne bien. Les données sur le secteur sont alors perdues :

hdparm --read-sector 568254 --yes-i-know-what-i-am-doing /dev/sdb
hdparm --write-sector 568254 --yes-i-know-what-i-am-doing /dev/sdb
hdparm --read-sector 568254 --yes-i-know-what-i-am-doing /dev/sdb

merci pour tout les gars, finalement c’est pas grave si je recois le mail jusqu’à la fin des temps, j’ai réactivé…et de toute façon j’ai l’habitude de sauvegarder vers disque externe…au cas ou le disque serait moins sain…

j’ai repris le sujet, et c’est résolu en montant toutes les partitions une à une et écrire sur tous les secteurs non utilisés
exécuter le rm dans la foulée…

dd if=/dev/zero of=/mnt/zero; rm -f /mnt/zero

—>

198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0