OHSHIT REPORT: Increased response times at Mehrathon launch and Mercatalyst site outages

34

At 10:57pm CT we experienced error rates on meh.com. This led to a slight increase in response times starting at 10:59pm CT. Eventually this caused a significant increase in response times at 11:14pm CT and disrupted the Mehrathon launch for most customers until 12:13am CT.

You can see that represented in the Transaction Duration and Transaction Error Rate graphs here.

During this time, our logs were filled with errors from our Redis cluster.

Redis (redis.io) is a popular in-memory data store used by millions of developers as a database, cache, streaming engine, and message broker. We use it for lots of things at Mercatalyst and host several Redis clusters on Microsoft Azure using their managed Azure Cache for Redis service (https://azure.microsoft.com/en-us/products/cache).

This particular Redis cluster had been humming along without any issue for at least the past 7 days (we don’t keep Redis metrics beyond that range of time). At 10:57pm CT the cluster experienced a dramatic increase in CPU usage and memory usage.

At 11:03pm CT we attempted to scale up the Redis cluster to give it additional CPU and memory. By 11:23pm CT the Redis cluster had scaled up and CPU and memory usage had recovered to normal operating levels.

This was enough for some of our meh.com customers to start seeing some signs of recovery. However, response times were still at elevated levels.

In addition, we were also experiencing Redis related errors on most of mercatalyst.com’s other sites as well (mediocritee.com, sidedeal.com, etc).

At 11:54pm CT, after numerous attempts at getting connectivity issues resolved with the new scaled up Redis cluster, we forced the Redis cluster to reboot.

After the reboot, at 12:13am CT, meh.com had fully recovered for all customers.

However, the reboot did little to fix connectivity issues with the Redis cluster and many of mercatalyst.com’s other sites.

We took the remaining affected sites offline so we could focus on ensuring stability of the Mehrathon event. As you can see, the transaction duration on meh.com has been much improved over the past couple hours.

After determining meh.com was stable we eventually brought the rest of our affected sites back online. The whole saga put a pretty big dent in what’s been a good streak over the past 90 days.

Here’s a good place to plug that we’re hiring DevOps Engineers. If you’ve enjoyed this OHSHIT REPORT and would like to help us keep our Redis cluster healthy and meh.com up and running check out our job postings over at mercatalyst.com/jobs.