Tuesday 23rd February 2016

Data Collection Downtime this night

We have suffered from a 60 minute long downtime tonight, during which no traces reached Tideways, before the system could automatically recover and all services where operational again. We are investigating why our automatic failovers and circuit breakers didn't prevent this from happening.

Update 11:26 On the first view this looks like a network problem inside our Elasticsearch cluster was responsible for the downtime. We are looking into the exact specifics and why our failsafe shutdown of workers didn't handle this scenario automatically.

Update 12:32 We have identified a bug where our workers didn't react to bad cluster health of Elasticsearch correctly and keep writing data against them as if everything was ok. A fix is now in testing and will be rolled out over the course of the day.

Update 14:04 The incident is closed, we have found fixes to avoid this from happening in the future. We are very sorry for any inconvenience this has caused you.