We are sharing planned and unplanned service disruptions of the Tideways Performance Monitoring and Profiling service here. If you are experiencing problems or want to inquire about ongoing problems please write to email@example.com.
Due to a planned node migration in our elasticsearch cluster, data processing will be slow temporarily. The migration should be completed shortly and data processing should catch up quickly. We will keep you posted.
The host migration is complete and our queues are clear again, everything operational.
Thursday 15th April 2021
No incidents reported
Wednesday 14th April 2021
Slow data processing
Processing of incoming data is slower than expected and we are experiencing a few minutes delay from accepting data to making it visible in the UI.
Edit 12:40 Our Elasitcsearch has come under load from what it seems a noisy neighbour in our hosting environment. The load is now recovered after 15 minutes and the backlog of tasks should be processed in the next 10-15 minutes.
Tuesday 13th April 2021
Data ingestion service is down
Due to high load and out of memory errors on both Beanstalk queueing servers in the timeframe from 17:17 - 18:12 Europe Berlin time we have suffered our longest outage in the last few years.
Unfortunately the Beanstalk servers got to their memory limit, so that no monitoring, profilng and exception tracking data could be stored from 17:30 to 18:12 and monitoring data collected during that time frame on your applications could not be stored in Tideways. You will see gaps in the graphs of these time frames. We are terribly sorry for this outage, its impact and duration on your ability to investigate performance problems during in this almost 60 minute period.
Over the next days and weeks we will analyse and learn from this outage and work on a solution that avoids these problems with our queuing and data ingestion servers in the future.
17:30 alerting went off
Edit 17:52: We found the problem, our Beanstalk queue servers were under high load from a monitoring process that was clocking the CPUs. This happened two both queue servers at the same time and as such no one could act as a fallback. It looks like Tideways was not able to accept data reliably from 17:17 until around 17:50. We are investigating further.
Edit 18:05 After disabling the monitoring service our queue services are not coming back up and we are investigating why.
Edit 18:15 Under load due to the monitoring plugin running rampage our queue servers ingested jobs until they reached the machines memory limit. We are using Beanstalk, which is an in memory database, backed by a log file. Restarting the service immediately lead to the out of memory killer to stop the process again. We have temporarily increased the RAM of both queues massively now to process the pending data and seeing a quick recovery.
Edit 18:26 Everything is processing again. We have a large backlog so expect some time until the current data is being displayed. We will make a more thorough analysis now and update this page when we know more information.
Edit 18:36 We estimate everything will be up to date again in 15-25 minutes.