Monday 23rd January 2017

Data Collection Queues not processing since 04:00 UTC

Our queues have not been processing data since 04:00 UTC in the night. We have fixed the problem and are adding capacity to process the backlog of data.

Edit 09:17 Europe/Berlin:

The queues should be back up to date in 15-20 minutes.

Edit 09:41 Europe/Berlin:

Processing takes a little longer, queues are estimated to be processed in 10-15 minutes.

Edit 09:55 Europe/Berlin:

All queues are now up to date.

Postmortem

We are very sorry this incident affected your use of Tideways and for the long delay from incident start to fix. We have let you down in providing highly reliant application monitoring. In the future we hope to get better at resolution when data is not processed and you cannot look at the current monitoring data, an outage that long is not what we expect from ourselves. Our work on stability and resiliency of the systems continues to make this kind of event less like.

What happend?

During the historical aggregation that takes place after midnight the workers processing the measurement data got stuck in a deadlock on a single transaction with a table shared between both aggregation and processing. Because this is literally the last transaction performed in measurement processing this has lead to 50 second delays for every task waiting for the deadlock to timeout and then processing the next job. Because we are processing around 2000 jobs of this type per minute the queues filled up a backlog of over 250.000 jobs until 09:00 when the issue was detected.

Our monitoring alerted the on-call engineer on this condition early on, but unfortunately being just a human, he didn't wake up from the mobile phone alerts. We are trying to improve our on call notifications, but ultimately it is more important to make our systems more resilient against this kind of failure and self-healing.

After the problem was identified a simple restart of all the workers fixed the issue with the deadlock and queues were processed at full speed again.

How can we prevent this from happening again?

For the future we have added three fixes:

  • On the macro level we integrated deadlock detection into our workers and automatically restart them now, if we are getting into the same problem again.
  • on the micro level, we changed the measurement worker to run even if the database is only available for reads. All the database writes in the measurement processing workers are non-essential, so they can fail gracefully.
  • The deadlock timeout for workers was reduced from the default of 50 seconds to a maximum of 10 seconds.