Our queues have not been processing data since 04:00 UTC in the night. We have fixed the problem and are adding capacity to process the backlog of data.
Edit 09:17 Europe/Berlin:
The queues should be back up to date in 15-20 minutes.
Edit 09:41 Europe/Berlin:
Processing takes a little longer, queues are estimated to be processed in 10-15 minutes.
Edit 09:55 Europe/Berlin:
All queues are now up to date.
We are very sorry this incident affected your use of Tideways and for the long delay from incident start to fix. We have let you down in providing highly reliant application monitoring. In the future we hope to get better at resolution when data is not processed and you cannot look at the current monitoring data, an outage that long is not what we expect from ourselves. Our work on stability and resiliency of the systems continues to make this kind of event less like.
During the historical aggregation that takes place after midnight the workers processing the measurement data got stuck in a deadlock on a single transaction with a table shared between both aggregation and processing. Because this is literally the last transaction performed in measurement processing this has lead to 50 second delays for every task waiting for the deadlock to timeout and then processing the next job. Because we are processing around 2000 jobs of this type per minute the queues filled up a backlog of over 250.000 jobs until 09:00 when the issue was detected.
Our monitoring alerted the on-call engineer on this condition early on, but unfortunately being just a human, he didn't wake up from the mobile phone alerts. We are trying to improve our on call notifications, but ultimately it is more important to make our systems more resilient against this kind of failure and self-healing.
After the problem was identified a simple restart of all the workers fixed the issue with the deadlock and queues were processed at full speed again.
For the future we have added three fixes: