OHSHIT REPORT: Fukubukuro 4

76

When something goes terribly wrong here at Mediocre Laboratories we draft up internal memos called "OHSHIT REPORTS". We use OHSHIT REPORTS to help share details across the company about what went wrong, coordinate work across teams for how we're going to fix it, and dive deep into root causes.

I thought I'd publicly share an OHSHIT REPORT with some details about our recent Fukubukuro deal. This is going to get pretty technical but I'll wrap it up with a summary at the end.

Background

At a high-level, the websites we create at Mediocre Laboratories follow a microservices-based architecture. We like to poke fun at Groupon, but their engineering team did a good job talking about the advantages of microservices over a monolithic-based architecture. Lots of sites are moving toward this architectual pattern. Amazon's doing it. I took Woot in this direction after the Amazon acquistion. We started Mediocre Laboratories from the beginning using microservices.

So you can think of meh.com as an application that relies on several other microservices to do its work. We have services for things like managing account creation and authentication, storing payment settings, displaying deals, showing polls and videos, placing orders, managing forum content, calculating stats, etc. Each one of these services runs on multiple servers behind a load balancer to increase performance and ensure if one server goes down the service continues to be available.

To put things into perspective, we typically run about 14 servers for our microservices. Since we're running in the cloud on Microsoft Azure we scaled up to 36 microservice servers the day before we sold Fukubukuro 4. This isn't counting the 8 servers running meh.com, 4 servers running mediocre.com, plus our database clusters and caching servers.

So, did we have enough servers? Looks like it. Here's some performance data from one of our meh.com servers. CPU not much above 50% and the memory, disk, and network all look good.

Ok, we should have enough servers. So what happened during Fukubukuro 4?

The Fukubukuro deal started at midnight ET, we got slammed with traffic as expected, and we quickly took 259 orders. Then the meh.com error rate spiked to 15%.

Orders stopped coming in immediately. It's not that they were coming in slowly because the site got slow. They just stopped completely. So what the hell? We can't take more than 259 orders? Where were the errors coming from?

The purple part of that graph above is our Account Service. It's the service that handles account sign ups, sign ins, and storage of payment settings. The average response time of this service spiked to insane numbers.

We begin investigating what was causing the Account Service response time to increase. I fully expected this to be related to retrieval of existing payment settings, or doing credit card postal code and security code checks when entering a new payment setting. Instead we identified a huge performance bottleneck when looking up thousands and thousands of OAuth tokens for users who are signing in and signing up for Mediocre Laboratories accounts using Amazon or Google. This is something we didn't encounter last time we had a Fukubukuro deal. It's somewhat related to the increased number of accounts in our database but mostly due to some horribly inefficient code we had handling third-party authentication.

It took us a bit to identify a viable performance optimization, but we coded it up and deployed it to all 8 servers running the Account Service at 1:05am ET. So we're good, right? Order Service, how you doing?

Yikes. Again, we investigate what's causing the Order Service response time to increase. Again, I fully expect this to be related to creating orders and processing credit cards. Instead we find a bottleneck looking up previous orders to ensure customers are limited to only buying one Fukubukuro. Again, somewhat related to the increased number of orders in our system but mostly due to some horribly inefficient code.

We coded up some performance optimizations and deploy them to all 8 servers running the Order Service at 1:28am ET. We waited for the deployment to complete before moving on to the next bottleneck and then... wait, what? We're sold out already?

Yeah, the performance optimzations worked and the next 870 orders were placed in 145 seconds.

So we're good for the next Fukubukuro right?

Let's not get carried away. I'm encouraged that homepage benefitted from some performance tuning we did after our last OHSHIT REPORT. Even though the Account Service and Order Service had issues, the homepage was generally available while we were fixing other performance bottlenecks. This wasn't the case for Fukubukuro 3.

@snapster and I were working on a supermarket analogy today. For Fukubukuro 3 we had trouble letting everyone in the supermarket front doors. For Fukubukuro 4 we fixed it so that everyone made it through the front doors, but this time they all stampeded the meat counter and overwhelmed the butcher. For Fukubukuro 5 we'll have fixed the butcher but you'll still need to make your way to the cash register, hand over a coupon, pay for your meat, make it back out the front doors without getting trampled, then hop in your car and hope you don't get into an accident in the parking lot.

Or to put in another way: fixing the next bottleneck helps you identify the next bottleneck.

Are we satisfied with the way Fukubukuro 4 turned out? Not even close. We're going to continue to work on making Fukubukuro run smoother and smoother. Fukubukuro 5 will be better than 4, Fukubukuro 6 will be better than 5. My goal is to keep improving them one after the other.

Oh, and what's the deal with the CAPTCHA stuff?

We recently released a developer API and @DJP519 was concerned this would lead to auto-purchasing Fukubukuro bots. We also really don't want to see bots buying Fukubukuro. We really, really want to see them go to our human customers. We'd rather do CAPTCHA than make you jump through a bunch of silly hoops to find secret Facebook links to prove you're a human. We'll likely be using Google's new reCAPTHCA on the checkout form, but only for Fukubukuro or other special events.

tl;dr

The site didn't work the way it was supposed to during Fukubukuro 4. We didn't enjoy that. We're going to continue to improve things. We're looking forward to Fukubukuro 5.