I run an old desktop mainboard as my homelab server. It runs Ubuntu smoothly at loads between 0.2 and 3 (whatever unit that is).
Problem:
Occasionally, the CPU load skyrockets above 400 (yes really), making the machine totally unresponsive. The only solution is the reset button.
Solution:
- I haven’t found what the cause might be, but I think that a reboot every few days would prevent it from ever happening. That could be done easily with a crontab line.
- alternatively, I would like to have some dead-simple script running in the background that simply looks at the CPU load and executes a reboot when the load climbs over a given threshold.
–> How could such a cpu-load-triggered reboot be implemented?
edit: I asked ChatGPT to help me create a script that is started by crontab every X minutes. The script has a kill-threshold that does a kill-9 on the top process, and a higher reboot-threshold that … reboots the machine. before doing either, or none of these, it will write a log line. I hope this will keep my system running, and I will review the log file to see how it fares. Or, it might inexplicable break my system. Fun!
Load average of 400???
You could install systat (or similar) and use output from sar to watch for thresholds and reboot if exceeded.
The upside of doing this is you may also be able to narrow down what is going on, exactly, when this happens, since sar records stats for CPU, memory, disk etc. So you can go back after the fact and you might be able to see if it is just a CPU thing or more than that. (Unless the problem happens instantly rather than gradually increasing).
PS: rather than using cron, you could run a script as a daemon that runs sar at 1 sec intervals.
Another thought is some kind of external watchdog. Curl webpage on server, if delay too long power cycle with smart home outlet? Idk. Just throwing crazy ideas out there.
Thank you for these ideas, I will read up on systat+sar and give it a go.
Also smart to have the script always running, sleeping, rather than launching it at intervals.
I know all of this is a poor hack, and I must address the cause - but so far I have no clues what’s causing it. I’m running a bunch of Docker containers so it is very likely one of them painting itself into a corner, but after a reboot there’s nothing to see, so I am now starting with logging the top process. Your ideas might work better.
Crontab to just auto reboot daily is probably better - if your PC becomes unresponsive I doubt it would be able to execute another script on top of everything. Ideally though, you’d do some log diving and figure out the cause.
This issue doesn’t happen very often, maybe every few weeks. That’s why I think a nightly reboot is overkill, and weekly might be missing the mark? But you are right in any case: regardless of what the cron says, the machine might never get around to executing it.
Have you tried turning your swap off?
Nope, haven’t. It says I have 2 GB of swap on a 16 GB RAM system, and that seems reasonable.
Why would you recommend turning swap off?
To check if your problem is caused by excessive memory usage requiring constant swapping. If it is, turning swap off will make some process be killed instead of slowing the computer down.
The symptoms you describe are exactly what happens to my machine when it runs out of memory and then starts swapping really hard. This is easy to check by seeing if disk io also spikes when it happens, and if memory usage is high
Run SMART short tests on your drives. Any “pending sectors” at all are failure.
If the test has any problems, especially pending sectors, replace the drive.
If your board support it, use watchdog.