Serveur surchargé

Bonjour

Je viens vers vous car j’en perd un peu mon latin

j’ai un serveur (squeeze) surchargé, mais je n’arrive pas à trouver l’origine du probleme

voila le résultat d’un top

top - 10:20:09 up 37 days, 20:14,  1 user,  load average: 35.10, 35.29, 35.49
Tasks: 389 total,   1 running, 388 sleeping,   0 stopped,   0 zombie
Cpu0  :  0.2%us,  0.0%sy,  0.0%ni, 49.4%id, 50.4%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.2%us,  0.2%sy,  0.0%ni,  0.0%id, 99.5%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  0.0%us,  0.6%sy,  0.0%ni, 19.3%id, 80.2%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  0.5%us,  1.0%sy,  0.0%ni, 98.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8178584k total,  6716256k used,  1462328k free,  1351280k buffers
Swap: 12292384k total,   130168k used, 12162216k free,  2146932k cached

Comment avoir un load average à 35 avec des cpu qui ne font rien!!

et j’ai fait une recherche via nload et je n’ai pas de traffic sup a 1Mbits/s en ce moment

enfin derniere info:

dmesg|tail
[2573341.452326] ct0 statd: server rpc.statd not responding, timed out
[2573341.452345] lockd: cannot monitor hastur
[2573401.452533] ct0 statd: server rpc.statd not responding, timed out
[2573401.452552] lockd: cannot monitor hastur
[2573461.452526] ct0 statd: server rpc.statd not responding, timed out
[2573461.452544] lockd: cannot monitor hastur
[2573521.452525] ct0 statd: server rpc.statd not responding, timed out
[2573521.452545] lockd: cannot monitor hastur
[2664325.434926] Registering the dns_resolver key type
[3234280.118791] scanning ...

c’est quoi ça?

ct0 statd: server rpc.statd not responding, timed out
lockd: cannot monitor hastur

merci pour votre coup de main.

Peut-tu voir ce que donne un htop (à installer si ce n’est pas déjà fait) ainsi qu’un ps faux.

Tu as beaucoup I/O wait (wa), c’est possible que les disques durs soient trop lent par rapport à tes besoins.
Pour savoir quel processus fait des I/O, tu peux utiliser la commande iotop provenant du paquet du même nom.

Cela peux signifier que tu as des taches qui attendent une entrée/sortie qui n’arrive pas. iotop peut être?
Regarde les processus en attente via htop.

Bonjour et merci pour votre oup de main

alors ce matin c’est de pire en pire

top - 09:12:47 up 38 days, 19:06,  1 user,  load average: 40.04, 40.92, 40.77
Tasks: 399 total,   1 running, 398 sleeping,   0 stopped,   0 zombie
Cpu0  :  1.5%us,  1.4%sy,  0.1%ni, 76.4%id, 20.5%wa,  0.0%hi,  0.1%si,  0.0%st
Cpu1  :  2.4%us,  3.9%sy,  0.1%ni, 67.9%id, 25.7%wa,  0.0%hi,  0.1%si,  0.0%st
Cpu2  :  1.6%us,  2.1%sy,  0.1%ni, 79.3%id, 16.7%wa,  0.0%hi,  0.1%si,  0.0%st
Cpu3  :  1.8%us,  1.9%sy,  0.1%ni, 85.9%id, 10.2%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:   8178584k total,  6518444k used,  1660140k free,  1275156k buffers
Swap: 12292384k total,    73400k used, 12218984k free,  2291312k cached

voila les résultats de ce que vous me demandez

htop

 1  [#                                  0.6%]     Tasks: 331 total, 1 running
  2  [#**                                3.2%]     Load average: 40.03 40.72 40.70 
  3  [#*                                 1.3%]     Uptime: 38 days, 19:08:08
  4  [*                                  1.3%]
  Mem[|||||||||||||||#######******2883/7986MB]
  Swp[|                            71/12004MB]

en dessous de ces infos j’ai une liste de processus en cours longue comme un jour sans fin, mais qui ne consomme pas de cpu ou memoire.

pour iotop

#iotop
Total DISK READ: 1013.06 B/s | Total DISK WRITE: 26.71 K/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND                             
31894 be/4 libvirt-     20.00 K    248.00 K  0.00 %  2.51 % kvm -S -M pc-0.12 -~,bus=pci.0,addr=0x4
  306 be/4 root          0.00 B     96.00 K  0.00 %  1.56 % [kjournald]
31907 be/4 libvirt-      4.00 K    836.00 K  0.00 %  1.17 % kvm -S -M pc-0.12 -~,bus=pci.0,addr=0x4
30752 be/4 root          0.00 B     36.00 K  0.00 %  1.13 % [kjournald]
31855 be/4 libvirt-     20.00 K    196.00 K  0.00 %  0.03 % kvm -S -M pc-0.12 -~,bus=pci.0,addr=0x4
31359 be/4 dnsmasq       0.00 B     44.00 K  0.00 %  0.02 % dnsmasq -x /var/run~.dpkg-old,.dpkg-new
31922 be/4 libvirt-     24.00 K    320.00 K  0.00 %  0.00 % kvm -S -M pc-0.12 -~,bus=pci.0,addr=0x4
31281 be/4 root          0.00 B     16.00 K  0.00 %  0.00 % nmbd -D
31301 be/4 root          0.00 B      4.00 K  0.00 %  0.00 % apcupsd

Pour iotop il s’agit d’un instantané, mais quoi qu’il en soit je n’ai pas plus de 10 processus qui attendent vraiment pas longtemps (meme pas une seconde).

Hum, que donne un «ps ax -H»? Et est ce que la machine repond très lentement au shell? Si elle semble répondre normalement, cela doit être des processus bloqués mais il faudrait identifier pourquoi.

Cela me parait probable, tu as Tasks: 399 total, 1 running, 398 sleeping, 0 stopped, 0 zombieSauf à avoir un monoprocesseur, tu devrais en avoir plusieurs qui tournent.

Tu dois avoir dans ton ps une batterie de processus identiques en état d’attente.

Ma machine ne répond pas forcement très lentement au shell sauf de temps en temps et cela sur des commandes aussi simples que ls -l ou même aprés 10 minutes je n’ai pas de retour.

Concernant les processus identiques ce n’est pas une batterie mais une armée

voila le resultat du ps ax -H

 ps ax -H
  PID TTY      STAT   TIME COMMAND
    2 ?        S      0:00 [kthreadd]
    3 ?        S      0:02   [migration/0]
    4 ?        S      2:13   [ksoftirqd/0]
    5 ?        S      0:00   [watchdog/0]
    6 ?        S      0:10   [migration/1]
    7 ?        S      1:32   [ksoftirqd/1]
    8 ?        S      0:00   [watchdog/1]
    9 ?        S      0:02   [migration/2]
   10 ?        S     10:52   [ksoftirqd/2]
   11 ?        S      0:00   [watchdog/2]
   12 ?        S      0:10   [migration/3]
   13 ?        S      2:47   [ksoftirqd/3]
   14 ?        S      0:00   [watchdog/3]
   15 ?        S      0:13   [events/0]
   16 ?        S      0:18   [events/1]
   17 ?        S      0:36   [events/2]
   18 ?        S      0:15   [events/3]
   19 ?        S      0:00   [khelper]
   20 ?        S      0:00   [netns]
   21 ?        S      0:00   [async/mgr]
   22 ?        S      0:00   [pm]
   23 ?        S      0:05   [sync_supers]
   24 ?        S      0:03   [bdi-default]
   25 ?        S      0:00   [kintegrityd/0]
   26 ?        S      0:00   [kintegrityd/1]
   27 ?        S      0:00   [kintegrityd/2]
   28 ?        S      0:00   [kintegrityd/3]
   29 ?        S      0:31   [kblockd/0]
   30 ?        S      0:12   [kblockd/1]
   31 ?        S      0:32   [kblockd/2]
   32 ?        S      0:12   [kblockd/3]
   33 ?        S      0:00   [kacpid]
   34 ?        S      0:00   [kacpi_notify]
   35 ?        S      0:00   [kacpi_hotplug]
   36 ?        S      0:00   [kseriod]
   41 ?        S      0:00   [kondemand/0]
   42 ?        S      0:00   [kondemand/1]
   43 ?        S      0:00   [kondemand/2]
   44 ?        S      0:00   [kondemand/3]
   45 ?        S      0:00   [ubstatd]
   46 ?        S      0:02   [khungtaskd]
   47 ?        S     12:26   [kswapd0]
   48 ?        S      0:00   [aio/0]
   49 ?        S      0:00   [aio/1]
   50 ?        S      0:00   [aio/2]
   51 ?        S      0:00   [aio/3]
   52 ?        S      0:00   [crypto/0]
   53 ?        S      0:00   [crypto/1]
   54 ?        S      0:00   [crypto/2]
   55 ?        S      0:00   [crypto/3]
  216 ?        S      0:00   [ksuspend_usbd]
  217 ?        S      0:00   [khubd]
  218 ?        S      0:00   [ata/0]
  219 ?        S      0:00   [ata/1]
  220 ?        S      0:00   [ata/2]
  221 ?        S      0:00   [ata/3]
  222 ?        S      0:00   [ata_aux]
  223 ?        S      0:00   [scsi_eh_0]
  224 ?        S      0:00   [scsi_eh_1]
  227 ?        S      0:00   [scsi_eh_2]
  265 ?        S      0:00   [usbhid_resumer]
  296 ?        S      0:00   [kstriped]
  306 ?        S      1:40   [kjournald]
  561 ?        S      0:00   [kpsmoused]
  562 ?        S      0:34   [edac-poller]
 1146 ?        SN   124:59   [kipmi0]
30709 ?        S      2:42   [flush-8:0]
30752 ?        S      1:51   [kjournald]
30753 ?        S      2:14   [kjournald]
31047 ?        D      0:04   [rpciod/0]
31048 ?        S      0:04   [rpciod/1]
31049 ?        D      0:04   [rpciod/2]
31050 ?        D      0:11   [rpciod/3]
31052 ?        S<     0:00   [kslowd000]
31053 ?        S<     0:00   [kslowd001]
31054 ?        S      0:32   [nfsiod]
31217 ?        S      0:13   [lockd]
31220 ?        S      0:00   [kvm-irqfd-clean]
31221 ?        S      0:00   [nfsd4]
31222 ?        S      7:10   [nfsd]
31223 ?        S      6:34   [nfsd]
31224 ?        S      6:42   [nfsd]
31225 ?        S      7:18   [nfsd]
31226 ?        S      6:34   [nfsd]
31227 ?        S      6:33   [nfsd]
31228 ?        S      7:07   [nfsd]
31229 ?        S      7:07   [nfsd]
31771 ?        S      0:00   [vzmond]
11178 ?        S      0:16   [cifsd]
    1 ?        Ss     0:22 init [2]  
  385 ?        S<s    0:00   udevd --daemon
22666 ?        S<     0:00     udevd --daemon
22667 ?        S<     0:00     udevd --daemon
31030 ?        Ss     0:00   /sbin/portmap
31043 ?        Ss     0:00   /sbin/rpc.statd
31060 ?        Ss     0:00   /usr/sbin/rpc.idmapd
31219 ?        Sl     0:06   /usr/sbin/rsyslogd -c4
31236 ?        Ss     0:02   /usr/sbin/rpc.mountd --manage-gids
31253 ?        Sl   193:30   /usr/lib/gridengine/sge_execd
31276 ?        Ss     0:00   /usr/sbin/vzeventd
31281 ?        Ss     1:06   /usr/sbin/nmbd -D
31284 ?        Ss     0:00   /usr/sbin/smbd -D
31925 ?        S      0:00     /usr/sbin/smbd -D
31290 ?        Ss     0:00   /usr/sbin/acpid
31301 ?        Ssl    5:23   /sbin/apcupsd
31355 ?        S      0:54   /usr/sbin/arpwatch -u arpwatch -N -p
31361 ?        Ss     1:11   /usr/sbin/gpm -m /dev/input/mice -t exps2
31379 ?        Sl     0:01   /usr/sbin/libvirtd -d
31405 ?        Ss     0:00   /usr/bin/dbus-daemon --system
31411 ?        Ss     0:00   /usr/sbin/sshd
31842 ?        Ss     0:00     sshd: root@pts/11
31873 pts/11   Ss     0:00       -bash
31891 pts/11   R+     0:00         ps ax -H
31423 ?        Ss     0:07   /usr/sbin/cron
31434 ?        S      0:00     /USR/SBIN/CRON
31514 ?        Ss     0:00       /bin/sh -c /root/control-temp
31518 ?        S      0:05         /bin/bash /root/control-temp
30847 ?        S      0:00           sleep 600
19971 ?        S      0:00     /USR/SBIN/CRON
19974 ?        Ss     0:00       /bin/sh -c /root/Sauvegarde_Incrementielle_par_
19975 ?        S      0:00         /bin/bash /root/Sauvegarde_Incrementielle_par
27028 ?        D      0:00           rm -f /backup/savecluster.22-04-15.tar.gz
23851 ?        S      0:00     /USR/SBIN/CRON
23856 ?        Ss     0:00       /bin/sh -c test -x /usr/sbin/anacron || ( cd / 
23858 ?        S      0:00         run-parts --report /etc/cron.daily
24198 ?        S      0:00           /bin/sh /etc/cron.daily/locate
24415 ?        SN     0:00             /bin/sh /usr/bin/updatedb.findutils
24462 ?        SN     0:00               /bin/sh /usr/bin/updatedb.findutils
24654 ?        SN     0:00                 su nobody -s /bin/sh -c /usr/bin/find
24655 ?        SN     0:00                   sh -c /usr/bin/find / -ignore_readd
24656 ?        DN     0:00                     /usr/bin/find / -ignore_readdir_r
24463 ?        SN     0:00               /usr/bin/sort -z -f
24466 ?        SN     0:00               /usr/lib/locate/frcode -0
27832 ?        S      0:00     /USR/SBIN/CRON
27837 ?        Ss     0:00       /bin/sh -c test -x /usr/sbin/anacron || ( cd / 
27839 ?        S      0:00         run-parts --report /etc/cron.daily
28166 ?        S      0:00           /bin/sh /etc/cron.daily/locate
28230 ?        SN     0:00             /bin/sh /usr/bin/updatedb.findutils
28238 ?        SN     0:00               /bin/sh /usr/bin/updatedb.findutils
28319 ?        SN     0:00                 su nobody -s /bin/sh -c /usr/bin/find
28320 ?        SN     0:00                   sh -c /usr/bin/find / -ignore_readd
28321 ?        DN     0:00                     /usr/bin/find / -ignore_readdir_r
28239 ?        SN     0:00               /usr/bin/sort -z -f
28240 ?        SN     0:00               /usr/lib/locate/frcode -0
31695 ?        S      0:00     /USR/SBIN/CRON
31699 ?        Ss     0:00       /bin/sh -c test -x /usr/sbin/anacron || ( cd / 
31703 ?        S      0:00         run-parts --report /etc/cron.daily
31985 ?        S      0:00           /bin/sh /etc/cron.daily/locate
31992 ?        SN     0:00             /bin/sh /usr/bin/updatedb.findutils
32007 ?        SN     0:00               /bin/sh /usr/bin/updatedb.findutils
32099 ?        SN     0:00                 su nobody -s /bin/sh -c /usr/bin/find
32100 ?        SN     0:00                   sh -c /usr/bin/find / -ignore_readd
32101 ?        DN     0:00                     /usr/bin/find / -ignore_readdir_r
32008 ?        SN     0:00               /usr/bin/sort -z -f
32009 ?        SN     0:00               /usr/lib/locate/frcode -0
 3451 ?        S      0:00     /USR/SBIN/CRON
 3456 ?        Ss     0:00       /bin/sh -c test -x /usr/sbin/anacron || ( cd / 
 3459 ?        S      0:00         run-parts --report /etc/cron.daily
 3632 ?        S      0:00           /bin/sh /etc/cron.daily/locate
 3720 ?        SN     0:00             /bin/sh /usr/bin/updatedb.findutils
 3779 ?        SN     0:00               /bin/sh /usr/bin/updatedb.findutils
 3787 ?        SN     0:00                 su nobody -s /bin/sh -c /usr/bin/find
 3788 ?        SN     0:00                   sh -c /usr/bin/find / -ignore_readd
 3789 ?        DN     0:00                     /usr/bin/find / -ignore_readdir_r
 3780 ?        SN     0:00               /usr/bin/sort -z -f
 3781 ?        SN     0:00               /usr/lib/locate/frcode -0
 7834 ?        S      0:00     /USR/SBIN/CRON
 7839 ?        Ss     0:00       /bin/sh -c test -x /usr/sbin/anacron || ( cd / 
 7841 ?        S      0:00         run-parts --report /etc/cron.daily
 8097 ?        S      0:00           /bin/sh /etc/cron.daily/locate
 8282 ?        SN     0:00             /bin/sh /usr/bin/updatedb.findutils
 8357 ?        SN     0:00               /bin/sh /usr/bin/updatedb.findutils
 8422 ?        SN     0:00                 su nobody -s /bin/sh -c /usr/bin/find
 8423 ?        SN     0:00                   sh -c /usr/bin/find / -ignore_readd
 8424 ?        DN     0:00                     /usr/bin/find / -ignore_readdir_r
 8358 ?        SN     0:00               /usr/bin/sort -z -f
 8359 ?        SN     0:00               /usr/lib/locate/frcode -0
12281 ?        S      0:00     /USR/SBIN/CRON
12285 ?        Ss     0:00       /bin/sh -c test -x /usr/sbin/anacron || ( cd / 
12287 ?        S      0:00         run-parts --report /etc/cron.daily
12423 ?        S      0:00           /bin/sh /etc/cron.daily/locate
12500 ?        SN     0:00             /bin/sh /usr/bin/updatedb.findutils
12761 ?        SN     0:00               /bin/sh /usr/bin/updatedb.findutils
12769 ?        SN     0:00                 su nobody -s /bin/sh -c /usr/bin/find
12770 ?        SN     0:00                   sh -c /usr/bin/find / -ignore_readd
12771 ?        DN     0:00                     /usr/bin/find / -ignore_readdir_r
12762 ?        SN     0:00               /usr/bin/sort -z -f
12763 ?        SN     0:00               /usr/lib/locate/frcode -0
16660 ?        S      0:00     /USR/SBIN/CRON
16665 ?        Ss     0:00       /bin/sh -c test -x /usr/sbin/anacron || ( cd / 
16668 ?        S      0:00         run-parts --report /etc/cron.daily
16956 ?        S      0:00           /bin/sh /etc/cron.daily/locate
17059 ?        SN     0:00             /bin/sh /usr/bin/updatedb.findutils
17067 ?        SN     0:00               /bin/sh /usr/bin/updatedb.findutils
17143 ?        SN     0:00                 su nobody -s /bin/sh -c /usr/bin/find
17144 ?        SN     0:00                   sh -c /usr/bin/find / -ignore_readd
17145 ?        DN     0:00                     /usr/bin/find / -ignore_readdir_r
17068 ?        SN     0:00               /usr/bin/sort -z -f
17069 ?        SN     0:00               /usr/lib/locate/frcode -0
18041 ?        S      0:00     /USR/SBIN/CRON
18044 ?        Ss     0:00       /bin/sh -c test -x /usr/sbin/anacron || ( cd / 
18047 ?        S      0:00         run-parts --report /etc/cron.daily
18272 ?        S      0:00           /bin/sh /etc/cron.daily/locate
18326 ?        SN     0:00             /bin/sh /usr/bin/updatedb.findutils
18384 ?        SN     0:00               /bin/sh /usr/bin/updatedb.findutils
18456 ?        SN     0:00                 su nobody -s /bin/sh -c /usr/bin/find
18457 ?        SN     0:00                   sh -c /usr/bin/find / -ignore_readd
18458 ?        DN     0:00                     /usr/bin/find / -ignore_readdir_r
18385 ?        SN     0:00               /usr/bin/sort -z -f
18386 ?        SN     0:00               /usr/lib/locate/frcode -0
19936 ?        S      0:00     /USR/SBIN/CRON
19942 ?        Ss     0:00       /bin/sh -c test -x /usr/sbin/anacron || ( cd / 
19943 ?        S      0:00         run-parts --report /etc/cron.daily
20211 ?        S      0:00           /bin/sh /etc/cron.daily/locate
20251 ?        SN     0:00             /bin/sh /usr/bin/updatedb.findutils
20326 ?        SN     0:00               /bin/sh /usr/bin/updatedb.findutils
20397 ?        SN     0:00                 su nobody -s /bin/sh -c /usr/bin/find
20398 ?        SN     0:00                   sh -c /usr/bin/find / -ignore_readd
20399 ?        DN     0:00                     /usr/bin/find / -ignore_readdir_r
20327 ?        SN     0:00               /usr/bin/sort -z -f
20328 ?        SN     0:00               /usr/lib/locate/frcode -0
24798 ?        S      0:00     /USR/SBIN/CRON
24804 ?        Ss     0:00       /bin/sh -c test -x /usr/sbin/anacron || ( cd / 
24805 ?        S      0:00         run-parts --report /etc/cron.daily
25196 ?        S      0:00           /bin/sh /etc/cron.daily/locate
25230 ?        SN     0:00             /bin/sh /usr/bin/updatedb.findutils
25273 ?        SN     0:00               /bin/sh /usr/bin/updatedb.findutils
25284 ?        SN     0:00                 su nobody -s /bin/sh -c /usr/bin/find
25285 ?        SN     0:00                   sh -c /usr/bin/find / -ignore_readd
25286 ?        DN     0:00                     /usr/bin/find / -ignore_readdir_r
25274 ?        SN     0:00               /usr/bin/sort -z -f
25275 ?        SN     0:00               /usr/lib/locate/frcode -0
29559 ?        S      0:00     /USR/SBIN/CRON
29564 ?        Ss     0:00       /bin/sh -c test -x /usr/sbin/anacron || ( cd / 
29567 ?        S      0:00         run-parts --report /etc/cron.daily
29821 ?        S      0:00           /bin/sh /etc/cron.daily/locate
29842 ?        SN     0:00             /bin/sh /usr/bin/updatedb.findutils
29956 ?        SN     0:00               /bin/sh /usr/bin/updatedb.findutils
30034 ?        SN     0:00                 su nobody -s /bin/sh -c /usr/bin/find
30035 ?        SN     0:00                   sh -c /usr/bin/find / -ignore_readd
30036 ?        DN     0:00                     /usr/bin/find / -ignore_readdir_r
29957 ?        SN     0:00               /usr/bin/sort -z -f
29958 ?        SN     0:00               /usr/lib/locate/frcode -0
 2138 ?        S      0:00     /USR/SBIN/CRON
 2142 ?        Ss     0:00       /bin/sh -c test -x /usr/sbin/anacron || ( cd / 
 2145 ?        S      0:00         run-parts --report /etc/cron.daily
 2601 ?        S      0:00           /bin/sh /etc/cron.daily/locate
 2606 ?        SN     0:00             /bin/sh /usr/bin/updatedb.findutils
 2750 ?        SN     0:00               /bin/sh /usr/bin/updatedb.findutils
 2759 ?        SN     0:00                 su nobody -s /bin/sh -c /usr/bin/find
 2760 ?        SN     0:00                   sh -c /usr/bin/find / -ignore_readd
 2761 ?        DN     0:00                     /usr/bin/find / -ignore_readdir_r
 2751 ?        SN     0:00               /usr/bin/sort -z -f
 2752 ?        SN     0:00               /usr/lib/locate/frcode -0
 6833 ?        S      0:00     /USR/SBIN/CRON
 6838 ?        Ss     0:00       /bin/sh -c test -x /usr/sbin/anacron || ( cd / 
 6841 ?        S      0:00         run-parts --report /etc/cron.daily
 7018 ?        S      0:00           /bin/sh /etc/cron.daily/locate
 7082 ?        SN     0:00             /bin/sh /usr/bin/updatedb.findutils
 7184 ?        SN     0:00               /bin/sh /usr/bin/updatedb.findutils
 7334 ?        SN     0:00                 su nobody -s /bin/sh -c /usr/bin/find
 7335 ?        SN     0:00                   sh -c /usr/bin/find / -ignore_readd
 7336 ?        DN     0:00                     /usr/bin/find / -ignore_readdir_r
 7185 ?        SN     0:00               /usr/bin/sort -z -f
 7186 ?        SN     0:00               /usr/lib/locate/frcode -0
11707 ?        S      0:00     /USR/SBIN/CRON
11712 ?        Ss     0:00       /bin/sh -c test -x /usr/sbin/anacron || ( cd / 
11714 ?        S      0:00         run-parts --report /etc/cron.daily
12056 ?        S      0:00           /bin/sh /etc/cron.daily/locate
12167 ?        SN     0:00             /bin/sh /usr/bin/updatedb.findutils
12178 ?        SN     0:00               /bin/sh /usr/bin/updatedb.findutils
12377 ?        SN     0:00                 su nobody -s /bin/sh -c /usr/bin/find
12378 ?        SN     0:00                   sh -c /usr/bin/find / -ignore_readd
12379 ?        DN     0:00                     /usr/bin/find / -ignore_readdir_r
12179 ?        SN     0:00               /usr/bin/sort -z -f
12180 ?        SN     0:00               /usr/lib/locate/frcode -0
15005 ?        S      0:00     /USR/SBIN/CRON
15009 ?        Ss     0:00       /bin/sh -c test -x /usr/sbin/anacron || ( cd / 
15012 ?        S      0:00         run-parts --report /etc/cron.daily
15217 ?        S      0:00           /bin/sh /etc/cron.daily/locate
15383 ?        SN     0:00             /bin/sh /usr/bin/updatedb.findutils
15391 ?        SN     0:00               /bin/sh /usr/bin/updatedb.findutils
15399 ?        SN     0:00                 su nobody -s /bin/sh -c /usr/bin/find
15400 ?        SN     0:00                   sh -c /usr/bin/find / -ignore_readd
15401 ?        DN     0:00                     /usr/bin/find / -ignore_readdir_r
15392 ?        SN     0:00               /usr/bin/sort -z -f
15393 ?        SN     0:00               /usr/lib/locate/frcode -0
20029 ?        S      0:00     /USR/SBIN/CRON
20032 ?        Ss     0:00       /bin/sh -c test -x /usr/sbin/anacron || ( cd / 
20037 ?        S      0:00         run-parts --report /etc/cron.daily
20494 ?        S      0:00           /bin/sh /etc/cron.daily/locate
20499 ?        SN     0:00             /bin/sh /usr/bin/updatedb.findutils
20510 ?        SN     0:00               /bin/sh /usr/bin/updatedb.findutils
20519 ?        SN     0:00                 su nobody -s /bin/sh -c /usr/bin/find
20520 ?        SN     0:00                   sh -c /usr/bin/find / -ignore_readd
20521 ?        DN     0:00                     /usr/bin/find / -ignore_readdir_r
20511 ?        SN     0:00               /usr/bin/sort -z -f
20512 ?        SN     0:00               /usr/lib/locate/frcode -0
25173 ?        S      0:00     /USR/SBIN/CRON
25178 ?        Ss     0:00       /bin/sh -c test -x /usr/sbin/anacron || ( cd / 
25180 ?        S      0:00         run-parts --report /etc/cron.daily
25417 ?        S      0:00           /bin/sh /etc/cron.daily/locate
25431 ?        SN     0:00             /bin/sh /usr/bin/updatedb.findutils
25665 ?        SN     0:00               /bin/sh /usr/bin/updatedb.findutils
25814 ?        SN     0:00                 su nobody -s /bin/sh -c /usr/bin/find
25815 ?        SN     0:00                   sh -c /usr/bin/find / -ignore_readd
25816 ?        DN     0:00                     /usr/bin/find / -ignore_readdir_r
25666 ?        SN     0:00               /usr/bin/sort -z -f
25667 ?        SN     0:00               /usr/lib/locate/frcode -0
26161 ?        S      0:00     /USR/SBIN/CRON
26164 ?        Ss     0:00       /bin/sh -c /root/Sauvegarde_Incrementielle_par_
26165 ?        S      0:00         /bin/bash /root/Sauvegarde_Incrementielle_par
26167 ?        S      0:00           tar cvfz /backup/savecluster.01-06-15.tar.g
26168 ?        D      0:00             tar cvfz /backup/savecluster.01-06-15.tar
30519 ?        S      0:00     /USR/SBIN/CRON
30523 ?        Ss     0:00       /bin/sh -c test -x /usr/sbin/anacron || ( cd / 
30526 ?        S      0:00         run-parts --report /etc/cron.daily
30712 ?        S      0:00           /bin/sh /etc/cron.daily/locate
30836 ?        SN     0:00             /bin/sh /usr/bin/updatedb.findutils
31127 ?        SN     0:00               /bin/sh /usr/bin/updatedb.findutils
31208 ?        SN     0:00                 su nobody -s /bin/sh -c /usr/bin/find
31209 ?        SN     0:00                   sh -c /usr/bin/find / -ignore_readd
31210 ?        DN     0:00                     /usr/bin/find / -ignore_readdir_r
31128 ?        SN     0:00               /usr/bin/sort -z -f
31129 ?        SN     0:00               /usr/lib/locate/frcode -0
14251 ?        S      0:00     /USR/SBIN/CRON
14257 ?        Ss     0:00       /bin/sh -c test -x /usr/sbin/anacron || ( cd / 
14260 ?        S      0:00         run-parts --report /etc/cron.daily
14477 ?        S      0:00           /bin/sh /etc/cron.daily/locate
14550 ?        SN     0:00             /bin/sh /usr/bin/updatedb.findutils
14710 ?        SN     0:00               /bin/sh /usr/bin/updatedb.findutils
14718 ?        SN     0:00                 su nobody -s /bin/sh -c /usr/bin/find
14719 ?        SN     0:00                   sh -c /usr/bin/find / -ignore_readd
14720 ?        DN     0:00                     /usr/bin/find / -ignore_readdir_r
14711 ?        SN     0:00               /usr/bin/sort -z -f
14712 ?        SN     0:00               /usr/lib/locate/frcode -0
13763 ?        S      0:00     /USR/SBIN/CRON
13767 ?        Ss     0:00       /bin/sh -c test -x /usr/sbin/anacron || ( cd / 
13771 ?        S      0:00         run-parts --report /etc/cron.daily
13994 ?        S      0:00           /bin/sh /etc/cron.daily/locate
14018 ?        SN     0:00             /bin/sh /usr/bin/updatedb.findutils
14155 ?        SN     0:00               /bin/sh /usr/bin/updatedb.findutils
14238 ?        SN     0:00                 su nobody -s /bin/sh -c /usr/bin/find
14239 ?        SN     0:00                   sh -c /usr/bin/find / -ignore_readd
14240 ?        DN     0:00                     /usr/bin/find / -ignore_readdir_r
14156 ?        SN     0:00               /usr/bin/sort -z -f
14157 ?        SN     0:00               /usr/lib/locate/frcode -0
  734 ?        S      0:00     /USR/SBIN/CRON
  737 ?        Ss     0:00       /bin/sh -c test -x /usr/sbin/anacron || ( cd / 
  740 ?        S      0:00         run-parts --report /etc/cron.daily
 1018 ?        S      0:00           /bin/sh /etc/cron.daily/locate
 1025 ?        SN     0:00             /bin/sh /usr/bin/updatedb.findutils
 1172 ?        SN     0:00               /bin/sh /usr/bin/updatedb.findutils
 1313 ?        SN     0:00                 su nobody -s /bin/sh -c /usr/bin/find
 1314 ?        SN     0:00                   sh -c /usr/bin/find / -ignore_readd
 1315 ?        DN     0:00                     /usr/bin/find / -ignore_readdir_r
 1173 ?        SN     0:00               /usr/bin/sort -z -f
 1174 ?        SN     0:00               /usr/lib/locate/frcode -0
30198 ?        S      0:00     /USR/SBIN/CRON
30203 ?        Ss     0:00       /bin/sh -c test -x /usr/sbin/anacron || ( cd / 
30204 ?        S      0:00         run-parts --report /etc/cron.daily
30499 ?        S      0:00           /bin/sh /etc/cron.daily/locate
30562 ?        SN     0:00             /bin/sh /usr/bin/updatedb.findutils
30630 ?        SN     0:00               /bin/sh /usr/bin/updatedb.findutils
30676 ?        SN     0:00                 su nobody -s /bin/sh -c /usr/bin/find
30677 ?        SN     0:00                   sh -c /usr/bin/find / -ignore_readd
30678 ?        DN     0:00                     /usr/bin/find / -ignore_readdir_r
30631 ?        SN     0:00               /usr/bin/sort -z -f
30632 ?        SN     0:00               /usr/lib/locate/frcode -0
31712 ?        Ss     0:00   /usr/sbin/exim4 -bd -q30m
31737 ?        Ss     2:07   /usr/sbin/nrpe -c /etc/nagios/nrpe.cfg -d
31827 ?        Ds     0:18   [init]
31852 ?        Sl   207:39   /usr/bin/kvm -S -M pc-0.12 -enable-kvm -m 300 -smp
31893 ?        Sl   444:25   /usr/bin/kvm -S -M pc-0.12 -enable-kvm -m 804 -smp
31906 ?        Sl   6018:59   /usr/bin/kvm -S -M pc-0.12 -enable-kvm -m 128 -smp
31919 ?        Sl   2497:51   /usr/bin/kvm -S -M pc-0.12 -enable-kvm -m 128 -smp
32229 ?        Ss     1:53   /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 107:110
 1451 ?        Ss     0:20   init [2]      
 2085 ?        Ss     0:20   init [2]      
 3477 ?        Sl   531:18   ruby /opt/ovz-web-panel//utils/watchdog/watchdog.rb
 6983 ?        S      1:01   ruby /opt/ovz-web-panel//script/server webrick -e p
 6987 ?        Sl   188:41   ruby /opt/ovz-web-panel//utils/hw-daemon/hw-daemon.
 7728 ?        Ss     2:37   /usr/sbin/munin-node
 7781 tty1     Ss+    0:00   /sbin/getty 38400 tty1
 7782 tty2     Ss+    0:00   /sbin/getty 38400 tty2
 7783 tty3     Ss+    0:00   /sbin/getty 38400 tty3
 7784 tty4     Ss+    0:00   /sbin/getty 38400 tty4
 7785 tty5     Ss+    0:00   /sbin/getty 38400 tty5
 7786 tty6     Ss+    0:00   /sbin/getty 38400 tty6
19978 ?        D      6:27   [gzip]
 8577 ?        D      0:00   ls --color=auto -h -Ch --color=auto -h -Cahl
 9785 ?        D      0:00   -bash
19013 ?        Ds     0:00   shutdown -h 0 w
15133 ?        D      0:00   ls --color=auto -h -Ch --color=auto -h -Cahl
28509 ?        S<L    0:12   /usr/bin/atop -a -w /var/log/atop.log 600
11184 ?        Ds     0:00   -bash
11309 ?        D      0:00   ls --color=auto -h -Ch --color=auto -h -Cahl
12538 ?        D      0:00   ls --color=auto -h -Ch --color=auto -h -Cahl
20734 ?        S      0:00   apt-get install virt-manager
22597 pts/10   Ss+    0:00     /usr/bin/dpkg --status-fd 14 --configure libdrm2
22628 pts/10   S+     0:00       /bin/sh -e /var/lib/dpkg/info/fuse-utils.postin
22643 pts/10   S+     0:00         /bin/sh /usr/sbin/invoke-rc.d fuse start
22659 pts/10   S+     0:00           /bin/sh /etc/init.d/fuse start
22663 pts/10   D+     0:00             modprobe fuse
25727 ?        S      0:00   /usr/sbin/dnsmasq -x /var/run/dnsmasq/dnsmasq.pid -

j’ai essayé d’arrêter des processus, et même d’en killer certains, mais ils sont toujours la sans retour d’info du shell

J’ai programmé un reboot complet du serveur Lundi matin, mais bon je voudrais quand même bien savoir pourquoi je me retrouve dans cette situation et comment éviter qu’elle se reproduise.

Tu as un des tes volumes NFS en rade. Les updatedb et les sauvegardes gèlent dans l’attente d’une réponse. Tue au moins les updatedb par un kill -9. Vérifie le montage et le cablage réseau.

Ok :open_mouth:

Bon ben je regarderai ça a tête reposée lundi matin, ce n’est quand même pas une bonne nouvelle :confused:

Pour ma culture et ma connaissance linuxienne: ou as tu vu ou comment as tu déduis qu’il y a un probleme de volume nfs?

Tu as des démons nfsd de lancés donc tu fais des partages nfs. Les processus qui bloquent sont des updatedb et des scripts de sauvegardes, et certains ls mettent dix minutes à se faire. Il y a donc très clairement un problème d’accès de système de fichiers. NFS et des prtages réseaux font ce genre de choses bloquantes et par expérience, en général c’est NFS.

Tu peux essayer de voir si tu as des répertoires où tu ne peux pas aller, ça te donnera des précisions.

Bonjour

oui en effet j’ai un gros repertoire en partage NFS, c’est d’ailleurs un repertoire assez important (heureusement que j’ai une sauvegarde assez récente, c’est ce que je fait via le script sauvegarde incrementielle)

Un probleme comme celui la peut se régler facilement ou je suis bon pour essayer de recuperer le recupérable, effacer le volume, le remonter et remonter le reste via la sauvegarde?

Tout cela dépend notamment de l’endroit où est physiquement le répertoire.

  • Identifie avec un ls quel répertoire est inaccessible

  • Si le répertoire est ailleurs que sur la machine et monté via NFS, alors il faut chercher du coté de la liaison physique ou bien du serveur NFS distant. Essaye de démonter et remonter le partage, vérifie que le serveur répond que les règles de parefeu n’ont pas été modifiées.

  • Si le répertoire est sur la machine et en partage NFS, c’est plus curieux. J’arrêterai dans ce cas le serveur NFS pour voir si cela change quelque chose. Vérifie là encore les liaisons physiques (NFS est très sensible sur ce point par exéprience).

  • Si le repertoire est sur la machine et que l’arrêt NFS ne change rien, là tu peux paniquer un peu: prend un café, réfléchis où tu as mis la dernière sauvegarde, regarde quelle type d’erreurs c’est dans les sylog, recopie les fichiers du volume concerné sur un disque tiers puis change le disque et restaure tes données.

Merci pour tes infos

Bon deja je peux te dire que c’est sur le serveur lui même (ce n’est pas moi qui ai monté et créé ce serveur).

Il y a un seul array qui est donc partagé en plusieurs partitions dont une partition data (qui est le fameux partage NFS) et le reste c’est du / et /home avec des VM KVm et OpenVz.

Tu me parle de liaison physique et la je commence à stresser car le serveur est en prod non stop depuis 2007…C’est aussi peut être un signe de fatigue des composants.

En tout cas j’ai prévu de déplacer ces datas sur une autre machine avec en plus un array raid5 (je reçois les disques cette semaine). Donc quoi qu’il en soit ce partage nfs va virer du serveur.

Comment est le raid? (/proc/mdstat), que dit smartctl? Est ce juste une branche de l’arborescence ou une partition complète

Si ma mémoire est bonne c’est un raid 0 de disques de 160Go et une partition complète, par contre je n’ai pensé à regarder si c’était un raid hardware ou software.

Je te confirme ça lundi matin.

merci

Je viens de relire le fil, si on s’intéresse à

ct0 statd: server rpc.statd not responding, timed out lockd: cannot monitor hastur
que j’avais oublié, cela implique nettement NFS. As tu fait une mise à jour récemment de ton serveur (voire d’un des clients NFS), j’avais eu ce genre de souci lors d’une mise à jour. Je me souviens avoir fait

  • Du ménage: J’ai viré les contenus de /var/lib/nfs/sm et /var/lib/nfs/sm.bak après avoir arrêté rpcbind et nfslock. (J’ai relancé le tout après le ménage).

  • Je crois qu’une fois, j’avais également arrêté nfslock et monté les répertoires avec l’option nolock.

Bonjour

Voila les info

le raid est un raid 5 hardware

cat /proc/mdstat
cat: /proc/mdstat: Aucun fichier ou dossier de ce type

pour l’autre info

smartctl -H /dev/sda1
smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
Smartctl open device: /dev/sda1 failed: DELL or MegaRaid controller, please try adding '-d megaraid,N'

Comment contourner ce message d’erreur de smartctl dans mon cas (je ne connais pas smartctl)?

je viens de rebooter mon serveur et voila la charge

top - 09:23:21 up 13 min,  1 user,  load average: 0.09, 0.66, 0.74
Tasks: 140 total,   1 running, 137 sleeping,   0 stopped,   2 zombie
Cpu0  :  0.0%us,  0.0%sy,  0.0%ni, 99.5%id,  0.5%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.5%us,  0.2%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  0.2%us,  0.5%sy,  0.0%ni, 98.5%id,  0.7%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  0.2%us,  0.2%sy,  0.0%ni, 99.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8178584k total,  2316784k used,  5861800k free,   119904k buffers
Swap: 12292384k total,        0k used, 12292384k free,   982824k cached

c’est bon mais pour combien de temps!!! Faut continuer les investigations

edit: je n’ai pas fait de mise à jour dernierement sur ce serveur

Merci

Pour smartctl et les raid Megaraid, tu peux voir ici:
thomas-krenn.com/en/wiki/Sm … Controller

[quote=“Minus”]
c’est bon mais pour combien de temps!!! Faut continuer les investigations

edit: je n’ai pas fait de mise à jour dernierement sur ce serveur

Merci[/quote]

As tu fait un parcours de l’arborescence: Tu fais

cd / ls -lR
et tu regardes si ça coinces.

Y-a-t-il eu une mise à jour d’un client?

le ls -lR est passé sans probleme du premier coup, aucune attente, de même que un simple ls -l sur la racine; alors que ça mettait des plombes avant

Il y a eu des mises à jour client mais rien qui aurait pu toucher au montage nfs.

Voici les tests smartctl:

smartctl -a -d megaraid,1  /dev/sda
smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

/dev/sda [megaraid_disk_01] [SAT]: Device open changed type from 'megaraid' to 'sat'
=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar SE Serial ATA family
Device Model:     WDC WD1600JS-75NCB3
Serial Number:    WD-WCANMF521738
Firmware Version: 10.02E04
User Capacity:    160 000 000 000 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Mon Jun  8 15:20:18 2015 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
                                        was suspended by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                 (5400) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  64) minutes.
Conveyance self-test routine
recommended polling time:        (   6) minutes.
SCT capabilities:              (0x103f) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0003   185   185   021    Pre-fail  Always       -       3733
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       63
  5 Reallocated_Sector_Ct   0x0033   199   199   140    Pre-fail  Always       -       4
  7 Seek_Error_Rate         0x000f   200   200   051    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   007   007   000    Old_age   Always       -       68318
 10 Spin_Retry_Count        0x0013   100   253   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0012   100   253   051    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       63
190 Airflow_Temperature_Cel 0x0022   078   034   045    Old_age   Always   In_the_past 22
194 Temperature_Celsius     0x0022   125   081   000    Old_age   Always       -       22
196 Reallocated_Event_Count 0x0032   199   199   000    Old_age   Always       -       1
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       9
200 Multi_Zone_Error_Rate   0x0009   200   200   051    Pre-fail  Offline      -       0

SMART Error Log Version: 1
ATA Error Count: 126 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 126 occurred at disk power-on lifetime: 62215 hours (2592 days + 7 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  01 51 00 00 00 4f 40  Error: AMNF at LBA = 0x004f0000 = 5177344

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  42 00 00 00 00 4f 00 00   5d+18:02:00.891  READ VERIFY SECTOR(S) EXT
  42 00 00 00 00 4f 00 00   5d+18:01:58.894  READ VERIFY SECTOR(S) EXT
  42 00 00 00 00 4f 00 00   5d+18:01:56.881  READ VERIFY SECTOR(S) EXT
  42 00 00 00 00 4f 00 00   5d+18:01:54.867  READ VERIFY SECTOR(S) EXT
  42 00 00 00 00 4f 00 00   5d+18:01:52.853  READ VERIFY SECTOR(S) EXT

Error 125 occurred at disk power-on lifetime: 62215 hours (2592 days + 7 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  01 51 00 00 00 4f 40  Error: AMNF at LBA = 0x004f0000 = 5177344

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  42 00 00 00 00 4f 00 00   5d+18:01:58.894  READ VERIFY SECTOR(S) EXT
  42 00 00 00 00 4f 00 00   5d+18:01:56.881  READ VERIFY SECTOR(S) EXT
  42 00 00 00 00 4f 00 00   5d+18:01:54.867  READ VERIFY SECTOR(S) EXT
  42 00 00 00 00 4f 00 00   5d+18:01:52.853  READ VERIFY SECTOR(S) EXT
  42 00 00 00 00 4f 00 00   5d+18:01:50.848  READ VERIFY SECTOR(S) EXT

Error 124 occurred at disk power-on lifetime: 62215 hours (2592 days + 7 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  01 51 00 00 00 4f 40  Error: AMNF at LBA = 0x004f0000 = 5177344

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  42 00 00 00 00 4f 00 00   5d+18:01:56.881  READ VERIFY SECTOR(S) EXT
  42 00 00 00 00 4f 00 00   5d+18:01:54.867  READ VERIFY SECTOR(S) EXT
  42 00 00 00 00 4f 00 00   5d+18:01:52.853  READ VERIFY SECTOR(S) EXT
  42 00 00 00 00 4f 00 00   5d+18:01:50.848  READ VERIFY SECTOR(S) EXT
  42 00 00 00 80 4e 00 00   5d+18:01:50.573  READ VERIFY SECTOR(S) EXT

Error 123 occurred at disk power-on lifetime: 62215 hours (2592 days + 7 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  01 51 00 00 00 4f 40  Error: AMNF at LBA = 0x004f0000 = 5177344

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  42 00 00 00 00 4f 00 00   5d+18:01:54.867  READ VERIFY SECTOR(S) EXT
  42 00 00 00 00 4f 00 00   5d+18:01:52.853  READ VERIFY SECTOR(S) EXT
  42 00 00 00 00 4f 00 00   5d+18:01:50.848  READ VERIFY SECTOR(S) EXT
  42 00 00 00 80 4e 00 00   5d+18:01:50.573  READ VERIFY SECTOR(S) EXT
  42 00 00 00 00 4e 00 00   5d+18:01:50.300  READ VERIFY SECTOR(S) EXT

Error 122 occurred at disk power-on lifetime: 62215 hours (2592 days + 7 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  01 51 00 00 00 4f 40  Error: AMNF at LBA = 0x004f0000 = 5177344

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  42 00 00 00 00 4f 00 00   5d+18:01:52.853  READ VERIFY SECTOR(S) EXT
  42 00 00 00 00 4f 00 00   5d+18:01:50.848  READ VERIFY SECTOR(S) EXT
  42 00 00 00 80 4e 00 00   5d+18:01:50.573  READ VERIFY SECTOR(S) EXT
  42 00 00 00 00 4e 00 00   5d+18:01:50.300  READ VERIFY SECTOR(S) EXT
  42 00 00 00 80 4d 00 00   5d+18:01:50.024  READ VERIFY SECTOR(S) EXT

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%         1         -
# 2  Short offline       Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
smartctl -a -d megaraid,2  /dev/sda
smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

/dev/sda [megaraid_disk_02] [SAT]: Device open changed type from 'megaraid' to 'sat'
=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar SE Serial ATA family
Device Model:     WDC WD1600JS-75NCB3
Serial Number:    WD-WCANMF319433
Firmware Version: 10.02E04
User Capacity:    160 000 000 000 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Mon Jun  8 15:21:13 2015 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
                                        was suspended by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                 (4980) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  60) minutes.
Conveyance self-test routine
recommended polling time:        (   6) minutes.
SCT capabilities:              (0x103f) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0003   184   184   021    Pre-fail  Always       -       3758
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       61
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   200   200   051    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   007   007   000    Old_age   Always       -       68327
 10 Spin_Retry_Count        0x0013   100   253   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0012   100   253   051    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       61
190 Airflow_Temperature_Cel 0x0022   075   048   045    Old_age   Always       -       25
194 Temperature_Celsius     0x0022   122   095   000    Old_age   Always       -       25
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0009   200   200   051    Pre-fail  Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%         1         -
# 2  Short offline       Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
smartctl -a -d megaraid,3  /dev/sda
smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

/dev/sda [megaraid_disk_03] [SAT]: Device open changed type from 'megaraid' to 'sat'
=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar SE Serial ATA family
Device Model:     WDC WD1600JS-75NCB3
Serial Number:    WD-WCANMF543667
Firmware Version: 10.02E04
User Capacity:    160 000 000 000 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Mon Jun  8 15:22:12 2015 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                 (4980) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  60) minutes.
Conveyance self-test routine
recommended polling time:        (   6) minutes.
SCT capabilities:              (0x103f) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0003   184   184   021    Pre-fail  Always       -       3758
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       61
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   100   253   051    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   007   007   000    Old_age   Always       -       68328
 10 Spin_Retry_Count        0x0013   100   253   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0012   100   253   051    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       61
190 Airflow_Temperature_Cel 0x0022   077   054   045    Old_age   Always       -       23
194 Temperature_Celsius     0x0022   124   101   000    Old_age   Always       -       23
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0009   200   200   051    Pre-fail  Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%         1         -
# 2  Short offline       Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
smartctl -a -d megaraid,4  /dev/sda
smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

/dev/sda [megaraid_disk_04] [SAT]: Device open changed type from 'megaraid' to 'sat'
=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar SE Serial ATA family
Device Model:     WDC WD1600JS-75NCB3
Serial Number:    WD-WCANMF319434
Firmware Version: 10.02E04
User Capacity:    160 000 000 000 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Mon Jun  8 15:22:56 2015 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                 (4980) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  60) minutes.
Conveyance self-test routine
recommended polling time:        (   6) minutes.
SCT capabilities:              (0x103f) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0003   185   185   021    Pre-fail  Always       -       3750
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       61
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   100   253   051    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   007   007   000    Old_age   Always       -       68320
 10 Spin_Retry_Count        0x0013   100   253   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0012   100   253   051    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       61
190 Airflow_Temperature_Cel 0x0022   077   050   045    Old_age   Always       -       23
194 Temperature_Celsius     0x0022   124   097   000    Old_age   Always       -       23
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0009   200   200   051    Pre-fail  Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%         1         -
# 2  Short offline       Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
smartctl -a -d megaraid,5  /dev/sda
smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

/dev/sda [megaraid_disk_05] [SAT]: Device open changed type from 'megaraid' to 'sat'
=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar SE Serial ATA family
Device Model:     WDC WD1600JS-75NCB3
Serial Number:    WD-WCANMF521499
Firmware Version: 10.02E04
User Capacity:    160 000 000 000 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Mon Jun  8 15:23:30 2015 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
                                        was suspended by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                 (4980) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  60) minutes.
Conveyance self-test routine
recommended polling time:        (   6) minutes.
SCT capabilities:              (0x103f) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0003   186   186   021    Pre-fail  Always       -       3675
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       61
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   200   200   051    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   007   007   000    Old_age   Always       -       68320
 10 Spin_Retry_Count        0x0013   100   253   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0012   100   253   051    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       61
190 Airflow_Temperature_Cel 0x0022   076   052   045    Old_age   Always       -       24
194 Temperature_Celsius     0x0022   123   099   000    Old_age   Always       -       24
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0009   200   200   051    Pre-fail  Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%         1         -
# 2  Short offline       Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

J’en déduis que le disque 1 a bien un soucis c’est bien ça ? (Bizarre car il n’est pas donné comme étant fail sur le serveur lui même (couleur de la led correspondant au disque).

Le disque 1 a eu un souci (coup de chaud on dirait) mais il y a 6000 heures soit près d’un an. Tes disques semblent assez vieux (7/8 ans) et tous de même age (attention, ils risquent de péter en même temps).Mais a priori pas d’erreurs graves récentes.
Le souci peut venir de là mais rien de permet (pour moi) de l’affirmer avec certitude.
Y-a-t-il dans les syslog du 4 juin une erreur genre r/w error, lseek error, IO error, etc. (Voir /var/log/syslog.3.gz, /var/log/syslog.4.gz et /var/log/syslog.5.gz).
Tu peux également chercher des erreurs NFS dans /var/log/daemon.log ou /var/log/daemon.log.1

C’est bizarre tout de même. Coté clients, pas de soucis signalés le 3 ou 4 juin?