Some systems are experiencing issues

About This Site

We are sharing planned and unplanned service disruptions of the Tideways Performance Monitoring and Profiling service here. If you are experiencing problems or want to inquire about ongoing problems please write to

Past Incidents

Tuesday 13th April 2021

Data ingestion service is down

Due to high load and out of memory errors on both Beanstalk queueing servers in the timeframe from 17:17 - 18:12 Europe Berlin time we have suffered our longest outage in the last few years.

Unfortunately the Beanstalk servers got to their memory limit, so that no monitoring, profilng and exception tracking data could be stored from 17:30 to 18:12 and monitoring data collected during that time frame on your applications could not be stored in Tideways. You will see gaps in the graphs of these time frames. We are terribly sorry for this outage, its impact and duration on your ability to investigate performance problems during in this almost 60 minute period.

Over the next days and weeks we will analyse and learn from this outage and work on a solution that avoids these problems with our queuing and data ingestion servers in the future.


17:30 alerting went off

Edit 17:52: We found the problem, our Beanstalk queue servers were under high load from a monitoring process that was clocking the CPUs. This happened two both queue servers at the same time and as such no one could act as a fallback. It looks like Tideways was not able to accept data reliably from 17:17 until around 17:50. We are investigating further.

Edit 18:05 After disabling the monitoring service our queue services are not coming back up and we are investigating why.

Edit 18:15 Under load due to the monitoring plugin running rampage our queue servers ingested jobs until they reached the machines memory limit. We are using Beanstalk, which is an in memory database, backed by a log file. Restarting the service immediately lead to the out of memory killer to stop the process again. We have temporarily increased the RAM of both queues massively now to process the pending data and seeing a quick recovery.

Edit 18:26 Everything is processing again. We have a large backlog so expect some time until the current data is being displayed. We will make a more thorough analysis now and update this page when we know more information.

Edit 18:36 We estimate everything will be up to date again in 15-25 minutes.

Monday 12th April 2021

No incidents reported

Sunday 11th April 2021

No incidents reported

Saturday 10th April 2021

No incidents reported

Friday 9th April 2021

No incidents reported

Thursday 8th April 2021

No incidents reported

Wednesday 7th April 2021

Problem Processing of incoming data

Our workers are currently not processing data and we are investigating why.

Edit 13:30 Everything has caught up and fixed itself now. The problem was related to adding a new node to our Elasticsearch cluster in combination with a restart of one of the nodes during a period of shard synchronization.

Edit 13:40 As we plan to rotate more Elasticsearch nodes today, this issue will be kept open as a reference should further processing problems occur.