Crash régulier de mon serveur

Tags: #<Tag:0x00007f509f9405d8>

Bonjour
J’ai un serveur Olimex Lime 2 tournant sous debian stable (yunohost) en autohébergement derrière une freebox (mini 4K)

admin@crust:~$ cat /etc/debian_version 
11.6
admin@crust:~$ 
admin@crust:~$ uname -a
Linux crust.ovh 5.10.105-olimex #072307 SMP Wed Oct 12 07:24:41 UTC 2022 armv7l GNU/Linux
admin@crust:~$ 
admin@crust:~$ sudo yunohost -v
yunohost: 
  repo: stable
  version: 11.0.10.2
    yunohost-admin: 
      repo: stable
      version: 11.0.11
    moulinette: 
      repo: stable
      version: 11.0.9
    ssowat: 
      repo: stable
      version: 11.0.9

admin@crust:~$ 
admin@crust:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            447M     0  447M   0% /dev
tmpfs           100M   12M   89M  12% /run
/dev/sda1       117G   75G   36G  68% /
tmpfs           500M     0  500M   0% /dev/shm
tmpfs           5.0M  4.0K  5.0M   1% /run/lock
tmpfs           100M     0  100M   0% /run/user/1007
admin@crust:~$ 
admin@crust:~$ free 
               total        used        free      shared  buff/cache   available
Mem:         1022188      200000      520396       14596      301792      783328
Swap:        2097148           0     2097148

Il fonctionne bien mais il plante en général une fois par jour. Plus d’accès au serveur de mail, ni à serveur internet, ni d’accès via ssh

Je ne sais trop où chercher dans les logs les éventuelles pistes de résolution de ce plantage

admin@crust:~$ sudo dmesg | egrep 'fail|err|warn|firm'
[    0.858951] hw perfevents: no interrupt-affinity property for /pmu, guessing.
[    1.333268] sdhci: Copyright(c) Pierre Ossman
[    8.459322] EXT4-fs (sda1): re-mounted. Opts: commit=600,errors=remount-ro
[   11.163763] random: 7 urandom warning(s) missed due to ratelimiting
[   12.067790] sunxi_cedrus: module is from the staging directory, the quality is unknown, you have been warned.
[   12.724358] lcd_olinuxino 2-0050: error reading from device at 00

Et smartctl

admin@crust:~$ sudo smartctl -a /dev/sda
smartctl 7.2 2020-12-30 r5155 [armv7l-linux-5.10.105-olimex] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Phison Driven OEM SSDs
Device Model:     SPCC Solid State Disk
Serial Number:    115E07060CCD00047974
Firmware Version: SBFM61.3
User Capacity:    128,035,676,160 bytes [128 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-4 (minor revision not indicated)
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Thu Dec 22 11:29:54 2022 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(65535) seconds.
Offline data collection
capabilities: 			 (0x79) SMART execute Offline immediate.
					No Auto Offline data collection support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (  30) minutes.
Conveyance self-test routine
recommended polling time: 	 (   6) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   050    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       17412
 12 Power_Cycle_Count       0x0012   100   100   000    Old_age   Always       -       79
168 SATA_Phy_Error_Count    0x0012   100   100   000    Old_age   Always       -       0
170 Bad_Blk_Ct_Erl/Lat      0x0003   100   100   000    Pre-fail  Always       -       0/60
173 MaxAvgErase_Ct          0x0012   100   100   000    Old_age   Always       -       60 (Average 24)
192 Unsafe_Shutdown_Count   0x0012   100   100   000    Old_age   Always       -       78
194 Temperature_Celsius     0x0023   067   067   000    Pre-fail  Always       -       33 (Min/Max 33/33)
218 CRC_Error_Count         0x000b   100   100   050    Pre-fail  Always       -       0
231 SSD_Life_Left           0x0013   100   100   000    Pre-fail  Always       -       99
241 Lifetime_Writes_GiB     0x0012   100   100   000    Old_age   Always       -       593

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      2069         -

SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Quand il crashe, plus aucun accès. Le rj45 le reliant à la box clignote de temps en temps c’est tout ce que je peux dire. Pour le rebooter, je le débranche et le rebranche.

Merci d’avance :wink:

Donc si quelqu’un a des pistes , je suis preneur.

il se met en veille peut être :slight_smile:

Jamais configuré cet etat de gestion de l’alimentation.
Encore plus surprenant que ça lui arrive de ne pas planter pendant une semaine et qu’en ce moment il plante quasi quotidiennement

T’as regarder la poussière ?
peut être changer la pate thermique du processeur?

Pas trop sale, nettoyé régulièrement

N’y a t il vraiment pas moyen de trouver le log lors du crash ? ça pourrait me mettre sur la piste

Bon j’ai laissé un

sudo dmesg -w

tourner en toile de fond, on verra bien ce qu’il raconte au plantage

Tu ferais mieux d’étudier plutôt le syslog aux heures des plantages et pourquoi pas le kernlog.

Voici un rapport du syslog, il a du planter vers 9.18h mais je ne sais trop que tirer de ces informations

Dec 22 09:17:54 crust glances[441]:   File "/usr/lib/python3/dist-packages/glances/autodiscover.py", line
 212, in __init__
Dec 22 09:17:54 crust glances[441]:     self.info = ServiceInfo(
Dec 22 09:17:54 crust glances[441]: TypeError: __init__() got an unexpected keyword argument 'address'
Dec 22 09:17:56 crust systemd[1]: glances.service: Main process exited, code=exited, status=1/FAILURE
Dec 22 09:17:56 crust systemd[1]: glances.service: Failed with result 'exit-code'.
Dec 22 09:17:56 crust systemd[1]: glances.service: Consumed 9.615s CPU time.
Dec 22 09:18:20 crust ntpd[481]: Soliciting pool server 51.255.95.80
Dec 22 09:18:20 crust ntpd[481]: Soliciting pool server 212.83.158.83
Dec 22 09:18:21 crust ntpd[481]: Soliciting pool server 51.68.44.27
Dec 22 09:18:21 crust ntpd[481]: Soliciting pool server 162.159.200.1
Dec 22 09:18:22 crust ntpd[481]: Soliciting pool server 51.15.182.163
Dec 22 09:18:22 crust ntpd[481]: Soliciting pool server 51.15.182.163
Dec 22 09:18:22 crust ntpd[481]: Soliciting pool server 2001:41d0:701:1100::285d
Dec 22 09:18:23 crust ntpd[481]: Soliciting pool server 2001:41d0:801:2000::acb
Dec 22 09:18:23 crust ntpd[481]: Soliciting pool server 178.32.23.17
Dec 22 09:18:24 crust ntpd[481]: Soliciting pool server 2001:41d0:2:52a7::1:4
Dec 22 09:18:24 crust ntpd[481]: Soliciting pool server 51.38.186.51
Dec 22 09:18:25 crust ntpd[481]: Soliciting pool server 185.254.101.25
Dec 22 09:18:25 crust ntpd[481]: Soliciting pool server 2001:678:8::123
Dec 22 09:18:26 crust ntpd[481]: Soliciting pool server 129.250.35.250
Dec 22 09:18:26 crust ntpd[481]: Soliciting pool server 37.59.63.125
Dec 22 09:18:27 crust ntpd[481]: Soliciting pool server 178.249.167.0
Dec 22 10:43:10 crust ntpd[481]: Soliciting pool server 82.64.32.33
Dec 22 10:43:10 crust ntpd[481]: receive: Unexpected origin timestamp 0xe74e9f63.dd64b977 does not match :

J’ai l’impression qu’il faut aller voir dans la configuration de ntp.
C’est probablement dans le fichier /etc/ntp.conf, tu dois sans doute avoir un nombre trop important de serveur amont, probablement que certains d’entre eux ne sont pas configurés correctement.

Bonjour et merci pour la réponse

Voici mon fichier ntp.conf

 cat /etc/ntp.conf 
# /etc/ntp.conf, configuration for ntpd; see ntp.conf(5) for help

driftfile /var/lib/ntp/ntp.drift

# Leap seconds definition provided by tzdata
leapfile /usr/share/zoneinfo/leap-seconds.list

# Enable this if you want statistics to be logged.
#statsdir /var/log/ntpstats/

statistics loopstats peerstats clockstats
filegen loopstats file loopstats type day enable
filegen peerstats file peerstats type day enable
filegen clockstats file clockstats type day enable


# You do need to talk to an NTP server or two (or three).
#server ntp.your-provider.example

# pool.ntp.org maps to about 1000 low-stratum NTP servers.  Your server will
# pick a different set every time it starts up.  Please consider joining the
# pool: <http://www.pool.ntp.org/join.html>
pool 0.debian.pool.ntp.org iburst
pool 1.debian.pool.ntp.org iburst
pool 2.debian.pool.ntp.org iburst
pool 3.debian.pool.ntp.org iburst


# Access control configuration; see /usr/share/doc/ntp-doc/html/accopt.html for
# details.  The web page <http://support.ntp.org/bin/view/Support/AccessRestrictions>
# might also be helpful.
#
# Note that "restrict" applies to both servers and clients, so a configuration
# that might be intended to block requests from certain clients could also end
# up blocking replies from your own upstream servers.

# By default, exchange time with everybody, but don't allow configuration.
restrict -4 default kod notrap nomodify nopeer noquery limited
restrict -6 default kod notrap nomodify nopeer noquery limited

# Local users may interrogate the ntp server more closely.
restrict 127.0.0.1
restrict ::1

# Needed for adding pool entries
restrict source notrap nomodify noquery

# Clients from this (example!) subnet have unlimited access, but only if
# cryptographically authenticated.
#restrict 192.168.123.0 mask 255.255.255.0 notrust


# If you want to provide time to your local subnet, change the next line.
# (Again, the address is an example only.)
#broadcast 192.168.123.255

# If you want to listen to time broadcasts on your local subnet, de-comment the
# next lines.  Please do this only if you trust everybody on the network!
#disable auth
#broadcastclient

Et quelques retours si ça peut aider

admin@crust:~$ ntpq -pn
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 0.debian.pool.n .POOL.          16 p    -   64    0    0.000   +0.000   0.002
 1.debian.pool.n .POOL.          16 p    -   64    0    0.000   +0.000   0.002
 2.debian.pool.n .POOL.          16 p    -   64    0    0.000   +0.000   0.002
 3.debian.pool.n .POOL.          16 p    -   64    0    0.000   +0.000   0.002
*212.83.158.83   145.238.203.14   2 u  150 1024  377    7.311   +0.107   0.272
+51.68.44.27     51.75.17.219     3 u  372 1024  377   11.347   -0.092   0.176
-2001:41d0:701:1 77.168.255.129   2 u 1042 1024  377   16.026   +0.186   0.332
-2001:41d0:801:2 29.243.228.195   3 u  338 1024  377   14.087   -0.357   0.164
-2001:678:8::123 85.199.214.102   2 u  830 1024  377   14.619   +0.036   0.207
+178.249.167.0   202.70.69.81     2 u  122 1024  377   11.877   -0.229   0.196

Et

admin@crust:~$  sudo systemctl stop ntp && sudo ntpd -gq
23 Dec 07:46:42 ntpd[9092]: ntpd 4.2.8p15@1.3728-o Wed Sep 23 11:46:38 UTC 2020 (1): Starting
23 Dec 07:46:42 ntpd[9092]: Command line: ntpd -gq
23 Dec 07:46:42 ntpd[9092]: ----------------------------------------------------
23 Dec 07:46:42 ntpd[9092]: ntp-4 is maintained by Network Time Foundation,
23 Dec 07:46:42 ntpd[9092]: Inc. (NTF), a non-profit 501(c)(3) public-benefit
23 Dec 07:46:42 ntpd[9092]: corporation.  Support and training for ntp-4 are
23 Dec 07:46:42 ntpd[9092]: available at https://www.nwtime.org/support
23 Dec 07:46:42 ntpd[9092]: ----------------------------------------------------
23 Dec 07:46:42 ntpd[9092]: proto: precision = 1.291 usec (-19)
23 Dec 07:46:42 ntpd[9092]: basedate set to 2020-09-11
23 Dec 07:46:42 ntpd[9092]: gps base set to 2020-09-13 (week 2123)
23 Dec 07:46:42 ntpd[9092]: leapsecond file ('/usr/share/zoneinfo/leap-seconds.list'): good hash signature
23 Dec 07:46:42 ntpd[9092]: leapsecond file ('/usr/share/zoneinfo/leap-seconds.list'): loaded, expire=2023-06-28T00:00:00Z last=2017-01-01T00:00:00Z ofs=37
23 Dec 07:46:42 ntpd[9092]: Listen and drop on 0 v6wildcard [::]:123
23 Dec 07:46:42 ntpd[9092]: Listen and drop on 1 v4wildcard 0.0.0.0:123
23 Dec 07:46:42 ntpd[9092]: Listen normally on 2 lo 127.0.0.1:123
23 Dec 07:46:42 ntpd[9092]: Listen normally on 3 eth0 192.168.0.46:123
23 Dec 07:46:42 ntpd[9092]: Listen normally on 4 lo [::1]:123
23 Dec 07:46:42 ntpd[9092]: Listen normally on 5 eth0 [2a01:e0a:82d:d0c0:321f:9aff:fed0:33ba]:123
23 Dec 07:46:42 ntpd[9092]: Listen normally on 6 eth0 [fe80::321f:9aff:fed0:33ba%2]:123
23 Dec 07:46:42 ntpd[9092]: Listening on routing socket on fd #23 for interface updates
23 Dec 07:46:43 ntpd[9092]: Soliciting pool server 195.83.132.135
23 Dec 07:46:44 ntpd[9092]: Soliciting pool server 178.249.167.0
23 Dec 07:46:44 ntpd[9092]: Soliciting pool server 162.159.200.123
23 Dec 07:46:45 ntpd[9092]: Soliciting pool server 51.15.175.180
23 Dec 07:46:45 ntpd[9092]: Soliciting pool server 151.80.168.4
23 Dec 07:46:45 ntpd[9092]: Soliciting pool server 2a01:cb19:896e:3d1d:c23f:d5ff:fe63:aa1
23 Dec 07:46:46 ntpd[9092]: Soliciting pool server 2a00:1080:807:200::5:1
23 Dec 07:46:46 ntpd[9092]: Soliciting pool server 95.81.173.74
23 Dec 07:46:46 ntpd[9092]: Soliciting pool server 178.33.111.49
23 Dec 07:46:46 ntpd[9092]: Soliciting pool server 94.23.215.121
23 Dec 07:46:47 ntpd[9092]: Soliciting pool server 37.59.63.125
23 Dec 07:46:47 ntpd[9092]: Soliciting pool server 2001:bc8:255e:200::1
23 Dec 07:46:47 ntpd[9092]: Soliciting pool server 51.15.182.163
23 Dec 07:46:48 ntpd[9092]: Soliciting pool server 5.39.92.42
23 Dec 07:46:48 ntpd[9092]: Soliciting pool server 2001:41d0:700:143f::
23 Dec 07:46:49 ntpd[9092]: Soliciting pool server 37.187.104.44
23 Dec 07:46:49 ntpd[9092]: Soliciting pool server 51.15.191.239
23 Dec 07:46:51 ntpd[9092]: ntpd: time slew -0.000237 s
ntpd: time slew -0.000237s

Et enfin

admin@crust:~$ sudo service ntp status
● ntp.service - Network Time Service
     Loaded: loaded (/lib/systemd/system/ntp.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/ntp.service.d
             └─ynh-override.conf
     Active: inactive (dead) since Fri 2022-12-23 07:46:42 UTC; 2min 2s ago
       Docs: man:ntpd(8)
    Process: 453 ExecStart=/usr/lib/ntp/ntp-systemd-wrapper (code=exited, status=0/SUCCESS)
   Main PID: 481 (code=exited, status=0/SUCCESS)
        CPU: 30.670s

Dec 23 07:46:42 crust.ovh ntpd[481]: 212.83.158.83 local addr 192.168.0.46 -> <null>
Dec 23 07:46:42 crust.ovh ntpd[481]: 51.68.44.27 local addr 192.168.0.46 -> <null>
Dec 23 07:46:42 crust.ovh ntpd[481]: 2001:41d0:701:1100::285d local addr 2a01:e0a:82d:d0c0:321f:9aff:fed>
Dec 23 07:46:42 crust.ovh ntpd[481]: 2001:41d0:801:2000::acb local addr 2a01:e0a:82d:d0c0:321f:9aff:fed0>
Dec 23 07:46:42 crust.ovh ntpd[481]: 2001:678:8::123 local addr 2a01:e0a:82d:d0c0:321f:9aff:fed0:33ba ->>
Dec 23 07:46:42 crust.ovh ntpd[481]: 178.249.167.0 local addr 192.168.0.46 -> <null>
Dec 23 07:46:42 crust.ovh systemd[1]: Stopping Network Time Service...
Dec 23 07:46:42 crust.ovh systemd[1]: ntp.service: Succeeded.
Dec 23 07:46:42 crust.ovh systemd[1]: Stopped Network Time Service.
Dec 23 07:46:42 crust.ovh systemd[1]: ntp.service: Consumed 30.670s CPU time.