We are sharing planned and unplanned service disruptions of the Tideways Performance Monitoring and Profiling service here. If you are experiencing problems or want to inquire about ongoing problems please write to firstname.lastname@example.org.
Data CollectionDelays in Data Processing due to Kernel Upgrades
Our hoster is upgrading the Linux Kernel on various application and database machines over the course of the night and we expect delays in data processing because of this.
We have prepared our queue system to automatically recover from the problem that caused large delays over several hours during yesterday's kernel upgrades & reboots.
Edit 00:38 A queue server has been restarted already and workers have processed all items, so no backlog has occurred.
Edit 00:45 Currently one of our Elasticsearch nodes restarted and we are watching it catch up on state with other nodes in the cluster. Given the size will take a while before the next node will be restarted. This is a standard operational scenario with Elasticsearch and causes no interruptions for data collection and the UI.
Edit 01:00 The database master is being restarted, causing a short downtime for the UI. Data collection should be unaffected.
Edit 01:11 The database master is back up. Due to the quick succession of reboots and one of the cache servers not being primed before the database went down a subset of data collection requests had errors between 0:57-1:11 and some of this data was sadly lost.
Edit 01:28 All critical services have now been rebooted, there are a few machines left, but we don't expect any interruptions anymore. We will keep the incident open until our hosting provider SysEleven gives the "all ok", probably this morning at 08:00 Europe/Berlin time.
Edit 05:37 All clear.
Wednesday 17th January 2018
Data CollectionProcessing large Queue Backlog
After a Spectre/Meltdown related reboot of one of our workers queue servers, we are slow in processing a large backlog that has piled up since 4:00 Europe/Berlin time this night. We are investigating if we can speed up the process of recovery faster.
4:00 Reboot of a queue server due to Linux kernel update lead to an uneven distribution of unprocessed jobs, which our workers don't handle well. Processing the queue server with large backlog count is very slow.
Edit:11:30 We recognized the backlog and are now investigating the cause and remedies.
Edit 12:15 We have processed all remaining queue jobs and are up to date again.