Thursday 31st January 2019

Data Collection Restart of hardware nodes could lead to short downtime

We are very sorry this routine maintenance task turned into an 2 hours 15 minutes outage this night. We have identified several improvements that we are now implementing to avoid this problem in the future.

Our hoster is restarting all hardware nodes due to upgrades all through the night (Europe/Berlin timezone), which could lead to short, few second or at most a handful minute outages. We hope that everything is done through the night as soon as possible.

Edit 7:20 We discovered there was a major problem rebooting one of our workers that has lead to a stop in processing. We started investigating.

Edit 7:45 First, we got processing working again and are seeing current monitoring and tracing data being imported. We now turn to understanding what went wrong and its impact.

Edit 8:15 After first investigation we found out that in our cluster of Beanstalk queue servers, one of our workers is stuck in disk file check for hours now and the second one was not able to start the beanstalkd process, so that for a period between 05:00 and 07:15 Europe/Berlin time, both queue servers were not available and as a consequence no monitoring or traces could be stored during this period. We continue to investigate why both nodes of the cluster failed.

10:15 We have identified and fixed a software bug that caused the workers to stop with an error in certain conditions when one of the connected servers was unavailable, leading to much slower processing of jobs during the period of 2:45 Europe/Berlin when one queue server was restarted and around 05:00 Europe/Berlin when the second queue server become unavailable as well. This doesn't explain yet why the second queue server suddenly crashed and stopped working as well, we will continue to investigate.

11:00 We have pieced together that at 5:00 beanstalk on the second still active node came under memory pressure and the Linux out of memory killer shut the queue service down and monit was not able to restart it until more memory was provided to the server during our restore work at 7:34.