Due to high load and out of memory errors on both Beanstalk queueing servers in the timeframe from 17:17 - 18:12 Europe Berlin time we have suffered our longest outage in the last few years.
Unfortunately the Beanstalk servers got to their memory limit, so that no monitoring, profilng and exception tracking data could be stored from 17:30 to 18:12 and monitoring data collected during that time frame on your applications could not be stored in Tideways. You will see gaps in the graphs of these time frames. We are terribly sorry for this outage, its impact and duration on your ability to investigate performance problems during in this almost 60 minute period.
Over the next days and weeks we will analyse and learn from this outage and work on a solution that avoids these problems with our queuing and data ingestion servers in the future.
17:30 alerting went off
Edit 17:52: We found the problem, our Beanstalk queue servers were under high load from a monitoring process that was clocking the CPUs. This happened two both queue servers at the same time and as such no one could act as a fallback. It looks like Tideways was not able to accept data reliably from 17:17 until around 17:50. We are investigating further.
Edit 18:05 After disabling the monitoring service our queue services are not coming back up and we are investigating why.
Edit 18:15 Under load due to the monitoring plugin running rampage our queue servers ingested jobs until they reached the machines memory limit. We are using Beanstalk, which is an in memory database, backed by a log file. Restarting the service immediately lead to the out of memory killer to stop the process again. We have temporarily increased the RAM of both queues massively now to process the pending data and seeing a quick recovery.
Edit 18:26 Everything is processing again. We have a large backlog so expect some time until the current data is being displayed. We will make a more thorough analysis now and update this page when we know more information.
Edit 18:36 We estimate everything will be up to date again in 15-25 minutes.