Brought down by cronjobs – analysis of an outage today
Last night, there was a disruption in the network of the data center our servers run in. We do not yet know the source or exact nature of the disruption, but it probably caused increased latencies and error rates in the connection of our application servers to our database servers. This alone would not be a problem in itself, but it caused a series of other problems that caused a full outage of our system starting approximately at 6:57 CET. I was alerted about the issue (and woken up) at 07:02 CET. At 7:04 CET, the system was already available again, but three of our four application servers were still not working.
In this post I want to go into some more detail on what happened, why, and what we’re going to do about it.
What happened
In pretix, there are a number of tasks that need to run periodically in the background. Some of them are clean-up tasks, such as the removal of old ticket files, others are directly interacting with users, such as sending out payment reminder emails or sending out tickets to attendees shortly before the event.
pretix has a very simple approach to these tasks. There is one command, python -m pretix runperiodic
, executing all of these operations. A few
are trigged in background tasks, but most of them are executed synchronously while running that command.
For self-hosted installations, we recommend running this command as a cronjob every 30 minutes. On our hosted system, we run it even more often. To make sure it runs even if one of our application servers fails, we run it on each one of them in shifted intervals, 6 times per hour in total if all servers are healthy.
As we are growing, the maybe a little naively-implemented tasks running in this interval, are taking longer and longer to complete on every run. We’ve been vaguely aware of this, but did not consider it an immediate problem at the moment. However, due to the unidentified network problems last night, the tasks took long enough to complete, that cron would start a new job before the old one was completed. After some hours, this caused the servers to run out of memory and become unusable.
Immediate solution
After identifying the problem around 7:14 CET, I rebooted the application nodes one by one. Then, I disabled the cronjob entirely on one of the servers to make sure this server would not be brought down again by this issue. This way, I could go back to sleep around 7:28 CET, after monitoring the system for a short while.
Mid-term solution
Later today, I measured the performance of the different tasks running in each cronjob run. Under normal conditions, one run woudl take just over 14 minutes, which is dangerously close to the interval the jobs run at, also causing multiple jobs to run in the cluster at any given time, thus creating unneccessary load on the system. Of these 14 minutes, 11 minutes were spent figuring out which orders to send an expire warning.
I optimized the code handling the slowest tasks in a series of changes and was able to reduce the run time command from 14 minutes to around just one minute – given that there are no actual emails to send out. These changes were live at around 16:45 CET.
Additionally, we switched from cron
handling these commands to systemd timers. Besides some nicer tooling, the main benefit of systemd timers is that they never get started while
the previous job is already running – a side-effect of the nature of systemd timers and services. This prevents similar effects if the time increases again in the future.
Long term solutions
In theory, not all of the commands we run in the timer need to be executed every couple of minutes – once a day would be just fine for many of them. We’re therefore considering a switch to a more elaborate structure of these tasks, either implemented within pretix or through a tool like celery beat. However, each such tool will come with it’s own challenges with regards to availability if we do not want to create single point of failure, so we’ll need to very carefully evaluate this.