Hardware failure and nodes need migration Friday 12th June 2020 12:45:00


A hardware node in our cluster failed and the located Virtual Machines need to be migrated to new hardware nodes in the next hours. This will affect our trace storage and time explorer components that are attached to the migrated nodes. These components will mostly disable themselves for the time of the migration or shows 404s that will resolve once the migration is done.

Edit June 12th, 18:48: The migration was moved to next week as it was too close to the weekend.

Edit June 23rd, 09:00: The migration will be performed this morning (Berlin time). Affected components will disable themselves in the UI and some traces might show 404s during the migration. We expect it to take a few minutes.

Edit June 23rd, 09:18: It looks like the migration is causing more than just affected components to go down. We are investigating.

Edit June 23rd, 09:45: Node migration is done. Everything is back to normal. We identified a high connection timeout of 5 seconds to the Redis server as cause for the ripple effect in failures and will deploy a fix to reduce the timeout to a much lower number.

Edit June 23rd, 09:52: The backlog of traces that were accepted during the downtime is still being processed. Expect the traces queue to be up to date in 10-15 minutes.