We have suffered from a 60 minute long downtime tonight, during which no traces reached Tideways, before the system could automatically recover and all services where operational again. We are investigating why our automatic failovers and circuit breakers didn't prevent this from happening.
On the first view this looks like a network problem inside our Elasticsearch cluster was responsible for the downtime. We are looking into the exact specifics and why our failsafe shutdown of workers didn't handle this scenario automatically.
We have identified a bug where our workers didn't react to bad cluster health of Elasticsearch correctly and keep writing data against them as if everything was ok. A fix is now in testing and will be rolled out over the course of the day.
The incident is closed, we have found fixes to avoid this from happening in the future. We are very sorry for any inconvenience this has caused you.