Delay in data processing - Leading to short term loss of newly incoming measurement data

Incident Report for Tideways

Postmortem

Summary

All dates in this postmortem are specified in UTC.

On Wed, November 5, from 21:30 to 22:55, we experienced issues with our incoming measurement data processing.

Parts of your incoming measurement data were lost; for most of your Tideways Project(s) you will see a data gap from 22:24 to 22:55. For data from 21:30 to 22:24 some of the recorded request counts might be lower than what actually occurred, with the number of missing requests gradually increasing until the full loss starting at 22:24.

We want to deeply apologize for the disruption in your monitoring.

In this post we want to explain the details of what happened and which steps we are taking to prevent this from happening again in the future.

No other data was lost or corrupted, and all data processing is running smoothly again. If you have any questions or concerns, please contact us at support@tideways.com.

What happened

We want to give you some insight into what happened. The specific order of events is outlined in the timeline below.

Our alerting informed us about processing delays in our beanstalkd queue backend. During the investigation, we found that an unusual amount of expensive-to-process messages were sent from a specific monitored Project.

Our worker processing speed dropped to 40% of normal throughput but didn’t stall completely. Investigation of stuck workers in this unfamiliar scenario took long enough for the queue to fill up and reject new messages.

To restore processing, the queue’s contents were backed up, and the queue was drained to stop it from rejecting new messages.

Still, more than half of our measurement data ingestion workers were stalled on these specific messages. Investigation into the codebases revealed an edge case that caused unnecessarily expensive MySQL queries.

After identifying the cause of these messages, we deployed a quick fix to reject them efficiently and resume normal operations.

Given the architecture of Tideways our customers’ local Tideways Daemon buffer and resend data. Due to these local buffers, the amount of flushed queue data will vary slightly from Project to Project.

After restoring operations and processing the backlog, we tried to restore the backed up messages but found the backed-up data to be incorrectly created due to insufficient experience with this failure mode. Given the unrecoverable backup, we could not process the pending data and thus experienced data loss.

Timeline

  • 2025-11-05 21:33 Queue processing backlog starts increasing.
  • 2025-11-05 21:36 Engineer on call is alerted with a low-severity alert.
  • 2025-11-05 22:06 Alert is escalated to high severity, and the backup on-call engineer is notified.
  • 2025-11-05 22:08 Investigation is started by both engineers. Performance degradation in queue processing times is identified.
  • 2025-11-05 22:10 Queue workers are restarted, and processing improves. The backup on-call engineer signs off, with the main on-call engineer continuing to monitor the situation.
  • 2025-11-05 22:15 Backlog recovery is observed to be stalling, and a deeper investigation is started.
  • 2025-11-05 22:29 https://status.tideways.io is updated to inform our customers about the ongoing investigation.
  • 2025-11-05 22:48 Queue backend servers run out of memory and shut down.
  • 2025-11-05 22:49 Decision is made to backup and flush the queue’s pending jobs to restore operations.
  • 2025-11-05 22:49 Website becomes intermittently unavailable, and external monitoring alerts fire in addition to server monitoring.
  • 2025-11-05 22:50 Queue backend is restored and resumes to operate. Backlog is building up again.
  • 2025-11-05 22:51 Direct cause of queue stalling is identified as specific messages with very long processing times.
  • 2025-11-05 23:05 Debugging of associated task processing code by the engineering team is started, investigating possible short-term fixes.
  • 2025-11-05 23:32 Possible fix is committed, reviewed, and deployed.
  • 2025-11-05 23:34 Deployment is successful and immediately resolves queue stalling.
  • 2025-11-05 23:51 Queue backlog is successfully processed, and the status page is updated.
  • 2025-11-06 00:25 Restoring the queue backup is unsuccessful. The process to communicate data loss is started.

Learnings and Improvements

What went well?

  • Our alerting proved very reliable and quick, informing us of every outage with the details we needed to act.
  • Response from the team was swift and collaborative with good communication throughout the process.
  • Hotfix deployment went smoothly due to well-established processes, practiced engineers, and automation.
  • After deploying the fix, processing the queue backlog went quickly due to us having provisioned enough server capacity for data spikes.

What didn’t go well?

We have already identified a couple of improvement points, with more to follow over the next few weeks.

  • Playbooks for managing our current queue infrastructure. The unfamiliarity with operating our queues and the manual intervention needed to back up the queueing data is what eventually led to the loss of customer data. We will create a process to handle this.
  • We will reevaluate our queuing technology and ensure we have a longer potential buffer, being able to store more data in cases where processing lags behind. With beanstalkd being a technology that keeps the whole queue in memory, we couldn’t offload messages to disk, which would have given us a more easy-to-maintain buffer.
  • Message processing will be improved to better handle the data ingest scenario that caused the issue.
  • We will develop the ability to turn off data ingestion on a per-project basis more easily to better respond to these types of scenarios. Especially until we can fully ensure that a single project can’t saturate our processing system.
Posted Nov 06, 2025 - 16:53 CET

Resolved

The incident has been resolved. Some customers will see up to 30 minutes of data missing. We will update you in more detail during the next day. Current data processing is running stably and is being monitored.
Posted Nov 06, 2025 - 01:07 CET

Update

We are continuing to monitor for any further issues.
Posted Nov 06, 2025 - 00:51 CET

Monitoring

We have identified and resolved the issue and are monitoring the system.
Posted Nov 06, 2025 - 00:51 CET

Update

We are continuing to investigate this issue.
Posted Nov 06, 2025 - 00:15 CET

Update

We are continuing to investigate this issue.
Posted Nov 06, 2025 - 00:07 CET

Investigating

We are currently investigating an issue regarding delayed data processing.
Posted Nov 05, 2025 - 23:29 CET
This incident affected: Ingest / Data Collection and User Interface.