OHSHIT REPORT: meh.com site outages at midnight launch

shawn posted some pics said
41

Background

When something goes terribly wrong here at Mediocre Laboratories we draft up internal memos called "OHSHIT REPORTS". We use OHSHIT REPORTS to help share details across the company about what went wrong, coordinate work across teams for how we're going to fix it, and dive deep into root causes to help make sure once the issue is fixed it doesn't happen again.

The entire company pays careful attention to these OHSHIT REPORTS partly because they're sometimes hilarious (in a face-palm way) but also because everyone gets to learn from other employee's mistakes and misfortunes.

Examples of recent OHSHIT REPORT titles include:

I thought I'd publicly share a recent OHSHIT REPORT I wrote up about our ongoing site performance issues we're seeing during midnight launches as we're going through some growing pains. Is it crazy to reach this level of transparency? Dunno, but Mediocre Laboratories is all about testing things out. So, here goes...

Summary

As of 11/14/2014, 4 of the past 7 meh.com events launches on meh.com have experienced increased response times, increased errors rates, and site outages.

Details

Here is a graph that represents the average meh.com application response time over the past 7 days. The x-axis is represented in Eastern Time. The y-axis is measured in milliseconds. Each pink bar represents a site outage.

Here is a graph of the meh.com error rate over the past 7 days. Again, the x-axis is represented in Eastern Time. The y-axis is the percentage of requests made to our site that ended up in an error. Each pink bar represents a site outage.

You'll notice the error rate for the 11/10, 11/12, and 11/13 outages are significantly smaller than the 11/14 (Fukubukuro) outage. One thing to point out is that the meh.com application has to actually be in a running state in order to report errors to the New Relic Application Performance Monitoring service we use. The site was mostly offline on 11/12 and 11/13. It was more available but also reporting more errors on 11/14.

So what is the application doing around these site outages?

The green bumps you see are external service calls to our Analytics Service which is responsible for keeping track of the number of unique visitors each day and other related stats you'll find at the bottom of the home page.

The big purple spike is external service calls to our Order Service which is responsible for, well, placing orders... which obviously fell over during the Fukubukuro deal.

Analytics Service

The meh.com application logs data about each request to our InfluxDB time series database using the influx-udp module we wrote and open sourced (https://github.com/mediocre/node-influx-udp). The Analytics Service is a REST service that sits in front of InfluxDB to perform queries and manages downsampled variants of large amounts of data. InfluxDB writes are very fast. Its reads, you'll see from the following graphs, not so much. We try to cache these reads heavily.

Order Service

The Order Service is responsible for placing orders. Credit card transactions are handled by Stripe and order status, order history, and tracking numbers are stored in MongoDB. We obviously crashed pretty hard when we put up Fukubukuro #3.

Action Items

Here's where we list a few of the things we've been working on to improve these performance issues.

This is really just the tip of the iceberg. Behind the scenes we're constantly tweaking timeout values, TCP socket keep-alive settings, and cache key expiry values. We're checking database indexes, upgrading various dependencies, checking error logs.

Summary

Whew, if you've made it through all that technical mumbo jumbo, congratulations. This is the touchy-feely part of the post where I ask you to stick with us through these growing pains. I hope you can takeaway and appreciate with all the details above that there's lots of moving pieces in play here. Don't oversimplify that we're just trying to be cheap and run the site on shitty hardware to save a buck (although we are also cheap).

I can't tell you how rewarding it is to work on a thing that so many people are fans of. These are challenging technical hurdles to overcome but it helps keep me going knowing that once we get these problems solved @Stallion will no longer be yelling at us to GET MOAR SERVERS, NOW!