OHSHIT REPORT: meh.com site outages at midnight launch
41Background
When something goes terribly wrong here at Mediocre Laboratories we draft up internal memos called "OHSHIT REPORTS". We use OHSHIT REPORTS to help share details across the company about what went wrong, coordinate work across teams for how we're going to fix it, and dive deep into root causes to help make sure once the issue is fixed it doesn't happen again.
The entire company pays careful attention to these OHSHIT REPORTS partly because they're sometimes hilarious (in a face-palm way) but also because everyone gets to learn from other employee's mistakes and misfortunes.
Examples of recent OHSHIT REPORT titles include:
- OHSHIT REPORT: Future deal posted to Facebook and Twitter with the price of $Infinity
- OHSHIT REPORT: We shipped all our customers the Turtle Beach Call of Duty MW3 headsets instead of the Razer Tiamat 2.2 Gaming Headset
- OHSHIT REPORT: Increased error rates on meh.com @ 10/16/2014 12:00:00 AM ET
I thought I'd publicly share a recent OHSHIT REPORT I wrote up about our ongoing site performance issues we're seeing during midnight launches as we're going through some growing pains. Is it crazy to reach this level of transparency? Dunno, but Mediocre Laboratories is all about testing things out. So, here goes...
Summary
As of 11/14/2014, 4 of the past 7 meh.com events launches on meh.com have experienced increased response times, increased errors rates, and site outages.
Details
Here is a graph that represents the average meh.com application response time over the past 7 days. The x-axis is represented in Eastern Time. The y-axis is measured in milliseconds. Each pink bar represents a site outage.
Here is a graph of the meh.com error rate over the past 7 days. Again, the x-axis is represented in Eastern Time. The y-axis is the percentage of requests made to our site that ended up in an error. Each pink bar represents a site outage.
You'll notice the error rate for the 11/10, 11/12, and 11/13 outages are significantly smaller than the 11/14 (Fukubukuro) outage. One thing to point out is that the meh.com application has to actually be in a running state in order to report errors to the New Relic Application Performance Monitoring service we use. The site was mostly offline on 11/12 and 11/13. It was more available but also reporting more errors on 11/14.
So what is the application doing around these site outages?
The green bumps you see are external service calls to our Analytics Service which is responsible for keeping track of the number of unique visitors each day and other related stats you'll find at the bottom of the home page.
The big purple spike is external service calls to our Order Service which is responsible for, well, placing orders... which obviously fell over during the Fukubukuro deal.
Analytics Service
The meh.com application logs data about each request to our InfluxDB time series database using the influx-udp module we wrote and open sourced (https://github.com/mediocre/node-influx-udp). The Analytics Service is a REST service that sits in front of InfluxDB to perform queries and manages downsampled variants of large amounts of data. InfluxDB writes are very fast. Its reads, you'll see from the following graphs, not so much. We try to cache these reads heavily.
Order Service
The Order Service is responsible for placing orders. Credit card transactions are handled by Stripe and order status, order history, and tracking numbers are stored in MongoDB. We obviously crashed pretty hard when we put up Fukubukuro #3.
Action Items
Here's where we list a few of the things we've been working on to improve these performance issues.
Double-checked locking around cache fetches (http://en.wikipedia.org/wiki/Double-checked_locking)
We've been seeing a stampede of requests to the Analytics Service on a cache miss when loading stats. In order to ensure only one call is made to the Analytics Service on a cache miss we added a double-checked lock to our open source cache library: petty-cache (https://github.com/mediocre/petty-cache). We deployed this change on 11/12/2014.Prime cache with tomorrow's stats.
We know what tomorrow's data is going to be and we already prime the cache with tomorrow's deal, video, poll, etc. We uncovered a couple stats that weren't being primed and those become a cache miss when a new deal launches. We deployed this fix on 11/12/2014 and then discovered an additional and more subtle bug that we fixed on 11/16/2014.Allow page to render with incomplete data.
If the site can't load non-critical data (poll, video, forum topics, stats, meh faces) we still want to allow the home page to render without those sections. We introduced a regression that prevented the page from loading when there are no order stats. We deployed a fix on 11/15/2014.Anonymous output cache.
For anonymous users (about 67% of our overall traffic) we don't need to render different HTML for the homepage. We don't need to show them if they clicked the meh button, or which meh buttons they have clicked over the past five months, or if they've voted in the poll, or change the account icon in the header. Rather than fetch all the home page data over and over again we can bypass that and cache the rendered HTML for the next anonymous user. We deployed this on 11/15/2014.Lazy load deal photo gallery and meh face images.
This isn't necessarily related to midnight launch performance on our servers, but should help the homepage appear to load slightly faster in your browser. We deployed this on 11/15/2014.
This is really just the tip of the iceberg. Behind the scenes we're constantly tweaking timeout values, TCP socket keep-alive settings, and cache key expiry values. We're checking database indexes, upgrading various dependencies, checking error logs.
Summary
Whew, if you've made it through all that technical mumbo jumbo, congratulations. This is the touchy-feely part of the post where I ask you to stick with us through these growing pains. I hope you can takeaway and appreciate with all the details above that there's lots of moving pieces in play here. Don't oversimplify that we're just trying to be cheap and run the site on shitty hardware to save a buck (although we are also cheap).
I can't tell you how rewarding it is to work on a thing that so many people are fans of. These are challenging technical hurdles to overcome but it helps keep me going knowing that once we get these problems solved @Stallion will no longer be yelling at us to GET MOAR SERVERS, NOW!
- 37 comments, 51 replies
- Comment
I love a good acronym:
Organizational
Heuristic
Statistical
Holistic
Investigative
Transcription
@dave I went with
Optimized
Horror
Servers:
Harrison's
Inducing
Terror
@dave At our school district we have the "Informational Test Checking Heuristics Yearly Building Academic Learning Levels Standardized Across Childrens' Classes". Needless to say they're VERY touchy about everyone getting the acronym right, and using ITCHYBALLSACK instead of the correct ITCHYBALLSACC gets one a good stern talking-to.
Very, very nice. I approve.
@RicoSuave didn't get the memo?
Fun activity: compare the peaks on the response time graphs with the times provided for noise complaints involving "unbridled screaming" on the Dallas police blotter
@harrison It's @cengland0 's fault
@kadagan Odd.. The embedded video doesn't respect the "Start at 0:23" portion of the link...
First, awesome to see this much transparency about this. Every company has problems, some are better at handling them than others, but it's extremely rare to see this level of honesty.
Also, this may be a dumb idea, but how big an improvement would it be if only the past week's Meh faces showed up on the main page instead of the full calendar? Maybe move the calendar to the account page or its own page instead? Not sure if that would help enough to be worth bothering with, but I figured I'd toss it out there.
@Starblind not a dumb idea. i'm working on somewhat of a similar theory to try and optimize our meh face loading. i'd like to keep the calendars around.
@shawn I dig the monthly calendar, but would be more than okay with having previous months not visible on the landing page.
this is way over my head. so, i present this random Simpsons image:
I won't even pretend to understand any of that. But thanks for sharing it, and thanks for working hard to fix the problem.
I just tested the Anonymous output cache. First access took 2067 milliseconds, second access took 666 milliseconds. That is pretty devilish technology you got there.
I will play Mr. Obvious Man and suggest some more things you have probably already started doing.
Use one of those web page monitoring services, or set up a few hosts at remote sites on the internet. Have them periodically time how long it takes to fetch the home page for an anonymous user, a cached logged in user, and an uncached logged in user (choose one randomly from a pool of a few hundred test accounts). Produce cool reports.
It's not all that hard to figure out what's behind the data. Basically when the RSS requests route through the FOREX put calls during the server's somnambulant periods, there is a negative effect on the load-bearing flux capacitor which offsets the ordering system's specular highlights during quiet moments of repose.
This can be solved either by utilising trilinear mip-mapping across the baptismal font, Heidrich–Seidel anisotropic distribution on a Beaufort scale, or a good hard push against the dorsal architraves until one hears a tiny "yay" sound from inside the server.
@Starblind don't forget to invert the polarity of the warp coils
@Starblind quit yer fancy talkin' and go find me a left-handed smoke bender.
@Shawn I love purple.
This seems like a good place to put a UDP joke.
Me: ...
You: I didn't get the joke.
@dashcloud Me: SYN You: ...
@Homechicken That joke makes me want to say ACK with one "A"!
@SirLouie @dashcloud @Homechicken You guys certainly have a NACK for coming up with these jokes.
@shawn Don't forget to do a bit of voodoo, a rain dance, and burn sage… oh yeah and beer - drink lots of beer… that all might help too.
@Kidsandliz beer always helps, more is better.
Seeing the Staff at Meh working hard to fix issues and being open about it as well. Makes me Proud to be a VMP here [wipes tear]
@Foxborn just don't get the tear in my beer.
Thank you for putting all of that out there to us; but also for saying it in as close to layman's terms as you can get, without dumbing it down to uselessness.
But you did totally ruin my mental image of you running around, head aflame, screaming, "Something's terribly WRONG!"
@Thumperchick we never said that doesn't also happen
@Thumperchick I just pictured a server room littered with empty liquor bottles and @shawn in the corner anxiously counting down til midnight
@Kleineleh @Thumperchick i'm ashamed to admit i kinda cracked the other night and yelled at my laptop for a little bit (if that makes you feel any better)
@Thumperchick You know the situation is really in the shitter when showers of sparks come out of the blinking control panels and someone falls from the upper railing with a Wilhelm scream.
@shawn hopefully no more yelling tonight and you can get some sleep. Congrats! :)
@Thumperchick that's what Happens when @jonT goes into the server room
@Foxborn wow, that's awesome! Super accurate, I'm not technically allowed in the server room anymore. Thanks for sharing that :)
@JonT Is this your first Fan Art???
@Foxborn Wow, the first mediocre fan art!
@Foxborn now that's truly an awesome T-shirt! When does it go on sale?
@2many2no Oh that shirt is just for employees, see the beaker???
@Thumperchick I wanted to be sure you didn't loose that awesome mental image
@Foxborn It is my first! I don't count the last 2 photoshop contests where people shooped my face into weird places.
That's awesome of you guys to be so transparent. I know it's been frustrating on my end when the site won't load, but then I remember it's not like it's the end of the world. This site is fun for me, and if it doesn't load for a few minutes, so be it. I'll just check back in 20. Thanks for all you do!
This. This is awesome. (Imagine an animated gif of Orsen Wells clapping, but in a totally not-sarcastic way right here.)
@neuromancer pics or it didn't happen
@shawn
Great information - thank you!
Meh has done something wrong, something terribly wrong to my head.
Looking at headlines tonight one read Cops : Men tried to ship baby body parts.
It took 4 times before I stopped seeing Cops : Meh tried to ship baby body parts.
What has Meh done to me?
@Outofmymind I thought they kept the baby arm
TL;DR: Everything went to crap when we added the clickable logo theme song.
@phatmass correction. Everything went to crap SO we added the clickable logo theme song, to distract you.
@phatmass I think the musical logo was added on Friday the 14th - the crap started well before that.
@hallmike hence why they put it up. To deter us away from realising that we use a Mediocure Web Site for daily entertainment, and really cheap shit we pass by at Walmart and Laugh
Thanks for the detailed explanation! It's nice to know that you guys are making this site better in response to our bitching :)
Nice, score some points with customers for addressing the issues and also possibly getting some free advice without having to pull the old "job interview" ploy that others do. I must applaud.
Please provide the link to the change record for these fixes. Additionally, I hope you related the changes to the problem record that was created based on the incidents from screaming users about meh.com sucking.
I am very relieved that there was a reason for the weirdness... but still utterly heartbroken because I look so forward to see what insanely stupid shit you pull out of your asshole, place in an oversized and half tore up box, kick down the stairs until some imbecile laborer picks it up and shot puts it into a truck, and decides to take the most rutted, 6 inch bumpy, back wood road, then dropped from 85,000 ft, where it lands on my doorstep, a shell of package it used to be, Fukubukuro #3 for $5. But Alas, it was not meant to be... THANKS MEH.COM FOR MY NEW ANTIDEPRESANT ADDICTION!
Site worked just fine tonight, so thanks for that!
BOOM GOES THE DYNAMITE
@shawn Good job, everything was super smooth tonight. Loaded, ordered, confirmed, no problems or noticable slowness.
@Starblind ... I'm waitin' for the next fukubukuro to agee with that statement. Recently all things have worked well but what about the next fuku run? I'll have to see before I present them with the exhaulted flying spagetti monster award.
@unkabob spoiler alert: the site's almost certainly gonna crash on the next fukubukuro. but we'll learn some new things and adapt.
@shawn ... K...
@shawn when did you say the next fukubukuro was again?
@TaRDy next time we want to do a load test
@harrison I feel like you should try tonight....you know just to make sure you have everything working
It's all mumbo jumbo to me- but thanks a bunch for fixing it! You guys rock! :)
my head exploded reading all that but i certainly appreciate all the hard work and head banging you mehsters are doing to ensure that our experience here is, well, meh.
OMG! On that fuku graph I think I saw my personal tic.. Never got through that night but at least you guys saw my footprint and knew I were friggin' trying.
As a dev, I absolutely love the transparency. Keep it up! And great job with the improvements, site was smooth as butter tonight.
@Shawn..thank you!, site loaded quickly last night, no problems with ordering.
Did anyone else scroll past all of the geek speak to get to the good part? Thanks for putting in all of the extra graph time so that I can buy shit I don't need faster! URTHEBEST!
@kevin8er The geek speak WAS the good part!!
@kevin8er Actually, I was thinking, "oh, when he breaks it all down like that, it actually makes sense"
@Shawn so many good gifs and videos above, anything else I could post would be lame.
Cool to see you guys, while cheap, are also clever and skilled.
Oops/Officially/etc. (fill in appropriate word with O)
Here
Stands
How
I
Transgressed/Triumphed/Toopided/etc.
Thanks for the shout out, glad to see the problem is fixed.....or at least for now....We shall see what the future holds. And hey you get a pass on Fukubukuro days , servers can run like a slug but only then! :)
ROAR!!! ME LIKE MOAR SERVERS!!!!:)
You can add servers, streamline content, push stuff to CDN...or you could randomize your schedule.
When you sell out at 8:12am...stroke your egos, puff up your chests and then put up the next Meh deal. Start the next cycle to expire at 8:12am, or whenever you have a sell out.
If the Meh deal is really "Meh", and you are barely selling any...ring the gong, start selling the new Meh and send the crap back to the marketing team for improved copy. You may also have to change the price, or combine with another deal. Call it Meh, too...or maybe Meh two or Meh squared...or maybe just 'eh'.
The point is, randomize your release, to keep from hitting that dreaded peak.
@mdbirnbaum ^ = Woot-Off mode. I think we'll have that mode someday but I don't like it everyday. It may not seem like it but it lowers the bar for us to do our job. Trivia: the Woot-off itself was a community suggestion around September 2004
@mdbirnbaum What they really should do is change the launch time to 4am eastern. Just as inconvenient to get up at 4 est or stay up till 1 pst.
@neuromancer as a ESTer NNNNNNNNNNOOOOOOOOOOOOOOOOOOOOOOOOOOOO!
@Foxborn I'm sure the support staff will have the same reaction.
@snapster It looks like the 10th anniversary came and went without much notice... https://web.archive.org/web/20041022075353/http://www.woot.com/
@editorkid it was the only time I was able to snag a boc
OHSHIT REPORT: Microsoft sucks. http://thenextweb.com/microsoft/2014/11/19/microsoft-azure-suffering-widespread-outage/
@Collin1000 SOLUTION find another Um..... Er ...... Find another Azure Provider to deal with
I love purple.
@Barney Yes Purple should be a good substitute for Azure