Thursday 18th January 2018

Data Collection Delays in Data Processing due to Kernel Upgrades

Our hoster is upgrading the Linux Kernel on various application and database machines over the course of the night and we expect delays in data processing because of this.

We have prepared our queue system to automatically recover from the problem that caused large delays over several hours during yesterday's kernel upgrades & reboots.

Edit 00:38 A queue server has been restarted already and workers have processed all items, so no backlog has occurred.

Edit 00:45 Currently one of our Elasticsearch nodes restarted and we are watching it catch up on state with other nodes in the cluster. Given the size will take a while before the next node will be restarted. This is a standard operational scenario with Elasticsearch and causes no interruptions for data collection and the UI.

Edit 01:00 The database master is being restarted, causing a short downtime for the UI. Data collection should be unaffected.

Edit 01:11 The database master is back up. Due to the quick succession of reboots and one of the cache servers not being primed before the database went down a subset of data collection requests had errors between 0:57-1:11 and some of this data was sadly lost.

Edit 01:28 All critical services have now been rebooted, there are a few machines left, but we don't expect any interruptions anymore. We will keep the incident open until our hosting provider SysEleven gives the "all ok", probably this morning at 08:00 Europe/Berlin time.

Edit 05:37 All clear.