We are sharing planned and unplanned service disruptions of the Tideways Performance Monitoring and Profiling service here. If you are experiencing problems or want to inquire about ongoing problems please write to firstname.lastname@example.org.
Tonight all host machines that run our virtual machines, including single point of failures such as primary database, are rebooted. This can lead to short downtimes or delays.
A first analysis shows that the rebooting causing a long restart of the database primary server did cause a three hours stretch of ingested data to be lost form 00:19 Europe/Berlin to 03:39. We will analyse what problems our failsaves had that caused this long outage, as the system is designed not to fail this way.
The database primary server is up again and the workers are processing the backlog. This will take a while and given the length of the primary downtime we are not sure yet if the data is complete.
Almost all machines have been restarted, but the primary database master is still outstanding.
The database master is currently being rebooted which is causing a downtime of UI and workers. Data is still ingesting and will be processed after the primary is up again.