There are a number of lessons to learn from this outage, and we are working now on implementing these measures to ensure this will not happen again. Here are some of the measures that we will take:
When making large changes, we need to do try to do more realistic testing to better simulate real workloads. This is a rather tricky thing to do, but if we would have done better load testing, we would likely have caught this issue earlier.
We have setup a public status page (status.screenly.io), which shows the status of our systems in real time based on a 3rd party monitoring status of our systems. This also includes information about outages.
We have already started to collect more application metrics (using Prometheus) such that we both spot trends better and more quickly find bottlenecks.
Configure better health checks and resource limitations on our back-end components such that even if a spike hits, the system will automatically cope with this.
Set up automatic scaling of the cluster, such that the system can automatically add more resources whenever needed.
Ensure that our alerting system works such that an engineer alerted (or even woken up) if there is an issue.
The good news however is that even if our back-end was inaccessible for a few hours, it did not impact the players at all. We designed Screenly from day one to be able to survive outages, so while users were unable to make changes to the content, there was no impact on the actual playback.
— — — — — ** ** We don’t know exactly what N was, as they never hit our load balancers. All we could see was that we had a good amount of traffic flowing through our load balancers even during this time period.  The configuration difference was a missing health check in a TCP load balancer, which was the reason why a percentage of the traffic was incorrectly routed and caused a timeout.