Reboot of all host machines in cluster

Incident Report for Tideways

Resolved

Tonight all host machines that run our virtual machines, including single point of failures such as primary database, are rebooted. This can lead to short downtimes or delays.

**Edit 03:36:** A first analysis shows that the rebooting causing a long restart of the database primary server did cause a three hours stretch of ingested data to be lost form 00:19 Europe/Berlin to 03:39. We will analyse what problems our failsaves had that caused this long outage, as the system is designed not to fail this way.

**Edit 03:27:** The database primary server is up again and the workers are processing the backlog. This will take a while and given the length of the primary downtime we are not sure yet if the data is complete.

**Edit 01:53:** Almost all machines have been restarted, but the primary database master is still outstanding.

**Edit 00:26:** The database master is currently being rebooted which is causing a downtime of UI and workers. Data is still ingesting and will be processed after the primary is up again.

Posted Oct 05, 2021 - 02:00 CEST