OHSHIT REPORT: meh.com site outages at midnight launch

41

Background

When something goes terribly wrong here at Mediocre Laboratories we draft up internal memos called "OHSHIT REPORTS". We use OHSHIT REPORTS to help share details across the company about what went wrong, coordinate work across teams for how we're going to fix it, and dive deep into root causes to help make sure once the issue is fixed it doesn't happen again.

The entire company pays careful attention to these OHSHIT REPORTS partly because they're sometimes hilarious (in a face-palm way) but also because everyone gets to learn from other employee's mistakes and misfortunes.

Examples of recent OHSHIT REPORT titles include:

  • OHSHIT REPORT: Future deal posted to Facebook and Twitter with the price of $Infinity
  • OHSHIT REPORT: We shipped all our customers the Turtle Beach Call of Duty MW3 headsets instead of the Razer Tiamat 2.2 Gaming Headset
  • OHSHIT REPORT: Increased error rates on meh.com @ 10/16/2014 12:00:00 AM ET

I thought I'd publicly share a recent OHSHIT REPORT I wrote up about our ongoing site performance issues we're seeing during midnight launches as we're going through some growing pains. Is it crazy to reach this level of transparency? Dunno, but Mediocre Laboratories is all about testing things out. So, here goes...

Summary

As of 11/14/2014, 4 of the past 7 meh.com events launches on meh.com have experienced increased response times, increased errors rates, and site outages.

Details

Here is a graph that represents the average meh.com application response time over the past 7 days. The x-axis is represented in Eastern Time. The y-axis is measured in milliseconds. Each pink bar represents a site outage.

Here is a graph of the meh.com error rate over the past 7 days. Again, the x-axis is represented in Eastern Time. The y-axis is the percentage of requests made to our site that ended up in an error. Each pink bar represents a site outage.

You'll notice the error rate for the 11/10, 11/12, and 11/13 outages are significantly smaller than the 11/14 (Fukubukuro) outage. One thing to point out is that the meh.com application has to actually be in a running state in order to report errors to the New Relic Application Performance Monitoring service we use. The site was mostly offline on 11/12 and 11/13. It was more available but also reporting more errors on 11/14.

So what is the application doing around these site outages?

The green bumps you see are external service calls to our Analytics Service which is responsible for keeping track of the number of unique visitors each day and other related stats you'll find at the bottom of the home page.

The big purple spike is external service calls to our Order Service which is responsible for, well, placing orders... which obviously fell over during the Fukubukuro deal.

Analytics Service

The meh.com application logs data about each request to our InfluxDB time series database using the influx-udp module we wrote and open sourced (https://github.com/mediocre/node-influx-udp). The Analytics Service is a REST service that sits in front of InfluxDB to perform queries and manages downsampled variants of large amounts of data. InfluxDB writes are very fast. Its reads, you'll see from the following graphs, not so much. We try to cache these reads heavily.

Order Service

The Order Service is responsible for placing orders. Credit card transactions are handled by Stripe and order status, order history, and tracking numbers are stored in MongoDB. We obviously crashed pretty hard when we put up Fukubukuro #3.

Action Items

Here's where we list a few of the things we've been working on to improve these performance issues.

  • Double-checked locking around cache fetches (http://en.wikipedia.org/wiki/Double-checked_locking)
    We've been seeing a stampede of requests to the Analytics Service on a cache miss when loading stats. In order to ensure only one call is made to the Analytics Service on a cache miss we added a double-checked lock to our open source cache library: petty-cache (https://github.com/mediocre/petty-cache). We deployed this change on 11/12/2014.

  • Prime cache with tomorrow's stats.
    We know what tomorrow's data is going to be and we already prime the cache with tomorrow's deal, video, poll, etc. We uncovered a couple stats that weren't being primed and those become a cache miss when a new deal launches. We deployed this fix on 11/12/2014 and then discovered an additional and more subtle bug that we fixed on 11/16/2014.

  • Allow page to render with incomplete data.
    If the site can't load non-critical data (poll, video, forum topics, stats, meh faces) we still want to allow the home page to render without those sections. We introduced a regression that prevented the page from loading when there are no order stats. We deployed a fix on 11/15/2014.

  • Anonymous output cache.
    For anonymous users (about 67% of our overall traffic) we don't need to render different HTML for the homepage. We don't need to show them if they clicked the meh button, or which meh buttons they have clicked over the past five months, or if they've voted in the poll, or change the account icon in the header. Rather than fetch all the home page data over and over again we can bypass that and cache the rendered HTML for the next anonymous user. We deployed this on 11/15/2014.

  • Lazy load deal photo gallery and meh face images.
    This isn't necessarily related to midnight launch performance on our servers, but should help the homepage appear to load slightly faster in your browser. We deployed this on 11/15/2014.

This is really just the tip of the iceberg. Behind the scenes we're constantly tweaking timeout values, TCP socket keep-alive settings, and cache key expiry values. We're checking database indexes, upgrading various dependencies, checking error logs.

Summary

Whew, if you've made it through all that technical mumbo jumbo, congratulations. This is the touchy-feely part of the post where I ask you to stick with us through these growing pains. I hope you can takeaway and appreciate with all the details above that there's lots of moving pieces in play here. Don't oversimplify that we're just trying to be cheap and run the site on shitty hardware to save a buck (although we are also cheap).

I can't tell you how rewarding it is to work on a thing that so many people are fans of. These are challenging technical hurdles to overcome but it helps keep me going knowing that once we get these problems solved @Stallion will no longer be yelling at us to GET MOAR SERVERS, NOW!