OHSHIT REPORT: Increased response times at Mehrathon launch and Mercatalyst site outages
34At 10:57pm CT we experienced error rates on meh.com. This led to a slight increase in response times starting at 10:59pm CT. Eventually this caused a significant increase in response times at 11:14pm CT and disrupted the Mehrathon launch for most customers until 12:13am CT.
You can see that represented in the Transaction Duration and Transaction Error Rate graphs here.
During this time, our logs were filled with errors from our Redis cluster.
Redis (redis.io) is a popular in-memory data store used by millions of developers as a database, cache, streaming engine, and message broker. We use it for lots of things at Mercatalyst and host several Redis clusters on Microsoft Azure using their managed Azure Cache for Redis service (https://azure.microsoft.com/en-us/products/cache).
This particular Redis cluster had been humming along without any issue for at least the past 7 days (we don’t keep Redis metrics beyond that range of time). At 10:57pm CT the cluster experienced a dramatic increase in CPU usage and memory usage.
At 11:03pm CT we attempted to scale up the Redis cluster to give it additional CPU and memory. By 11:23pm CT the Redis cluster had scaled up and CPU and memory usage had recovered to normal operating levels.
This was enough for some of our meh.com customers to start seeing some signs of recovery. However, response times were still at elevated levels.
In addition, we were also experiencing Redis related errors on most of mercatalyst.com’s other sites as well (mediocritee.com, sidedeal.com, etc).
At 11:54pm CT, after numerous attempts at getting connectivity issues resolved with the new scaled up Redis cluster, we forced the Redis cluster to reboot.
After the reboot, at 12:13am CT, meh.com had fully recovered for all customers.
However, the reboot did little to fix connectivity issues with the Redis cluster and many of mercatalyst.com’s other sites.
We took the remaining affected sites offline so we could focus on ensuring stability of the Mehrathon event. As you can see, the transaction duration on meh.com has been much improved over the past couple hours.
After determining meh.com was stable we eventually brought the rest of our affected sites back online. The whole saga put a pretty big dent in what’s been a good streak over the past 90 days.
Here’s a good place to plug that we’re hiring DevOps Engineers. If you’ve enjoyed this OHSHIT REPORT and would like to help us keep our Redis cluster healthy and meh.com up and running check out our job postings over at mercatalyst.com/jobs.
- 24 comments, 28 replies
- Comment
/showme a healthy Redis cluster
@heartny glad to see the /showme command made it through tonight’s saga.
What politician is to blame for this? Perhaps it is the state mine inspector or superintendent of schools!
Cluster is right!
@shawn made such a beautiful RFO it became a post
Maybe if you wouldn’t have sent e-mails to everyone about the mehrathon and let us be surprised like most times, this cluster fuck wouldn’t have happened.
@Felton10 i got an IRK
@therealjrn I didn’t maybe that is why I am complaining.
@therealjrn Update: I got mine on the first IRK of the morning at 8:40 AM so I can stop complaining.
damn reddit clusters
@carl669 you forgot your opportunity to add to the (cluster) fuck count. There. Fixed it for you. (grin).
Huh. The old ohshit-dot-report domain is gone. Even though it had been pretty much abandoned and redirected to Meh, its existence was a measure of Mediocre’s differentiation from the soulless commercial norms.
I haven’t even read it yet but @shawn and @snapster I Iove the OHSHIT reports and that is some classic meh.
I did miss those and I know it has to be way less common now that it’s built out. But still. When it happens and a technical share of why. I still really like.
If the DevOps role ever becomes 100% remote, call me.
@jo2y ah, yes. Forgot to mention 100% remote.
so-- you never figured it out, just did the Microsoft thing and rebooted everything??
@caffeineguy It’s on an Azure server. What else are you going to do when it wonks too much?
@caffeineguy @werehatrack AWS
@caffeineguy @capnjb Yeah, but not in ten minutes on a Thursday night.
@caffeineguy @werehatrack Agreed This is why I don’t sleep too much
Really appreciate including the technical information. I am decent with computer but not server side of things. Cool to hear a blip of what happens and what is going on.
And as such, everyone who comments on this thread gets a free IRK…
@hammi99 somehow, I don’t think so. That would be too easy. Irks aren’t supposed to be easy.
@hammi99 I think this happened because I actually got through on an IRK… it’s never happened before so I prolly broke the whole system.
@hammi99 @kimberlyc8 You and me both!
/showme a borked Redis cluster
@blaineg Cool! I think i see the problem… Someone left half a Matchbox car in there!
Enjoy the reports. But as soon as I see microsoft named in a ‘productive’ role in the process, all the mystery about what could have gone wrong simply disappears.
/showme a borked mercatalyst
@OldCatLady That poor thing is borked, alright!
So buying monster fingers five minutes into the mehrathon was a huge win.
/giphy I’m a winner
I usually use my PC when I buy things from meh with my phone as a backup. Both were stuck endlessly trying to connect last night. Then I remembered my laptop was on the desk next to me so I fired that up and it connected instantly. When the IRK showed up at 1am I was able to get one using the laptop while the other devices were still stuck on the finger monster page from an hour earlier. Moral of the story: the Internet is dark magic not to be understood but mere humans.
I appreciate this post:) I don’t understand the technical (I’m a biologist, computers are a GUI with magic run by you smart folks in the background), but the graphs and narrow tell a nice story and gives some humanity to “ugh, meh isn’t working, why isn’t it working frustrations.”
I feel like I have a better appreciation for how these issues come up and what the magicians in the background to to resolve it.
@rinrinrin @shawn I agree with @rinrinrin appreciation, and also that someone from Meh was working at that hour! It was a good notification system. What it doesn’t seem to do, though, is help target why or reasons for the glitches. Thus, the solutions are trial and error until the decision was to reboot the system and take the other sites offline (a more drastic approach). Did they figure out how to minimize the issues to prevent future similar issues?
@DTominator @rinrinrin @shawn In my decades of dealing with computers …
@DTominator @narfcake @rinrinrin @shawn
no matter how critical the system or what company there is a “pull the plug” fulcrum and where it lives on the spectrum will depend on numerous factors and who you can get on the phone.
Mostly I’m on the side of “who cares why,it’s stuck or getting worse, get it back online and we will analyze later”. Just reboot the server already
But it’s usually a slow initial roll and I check my entire subsystem while checking the server.
Then we have to make sure the boss knows or is on. Etc etc. By the time it happens there’s 20 people on a bridge and we are looking for a boss to say go.
When Cluster describes both the service and the state it’s in.
I of course know what Mehrathon is, but what’s Mehcatalyst?
@SylvreKat The parent company. They changed their name from mediocre to that.
@Kidsandliz Ah. Thanks for explaining. I missed the name change, and still thought they were mediocre.
@SylvreKat Well, they are still mediocre, but not not Mediocre.
And while I don’t understand the details of this oh shit report other than something went terribly terribly wrong, it also meant it took 45 seconds for an irk to sell out. As a result I got one because normally they are gone in under 15 seconds and it usually takes me about 15 seconds to load so they are already sold out. Thank you for crashing??? Yes defiantly. Oh wait. You are trying to sell stuff other than irks.
On the serious side, glad you got it fixed. Just offer a pile of irks in a row. That will give you a good test of how well the system can handle the load and if anything will break. We will be glad to “test” that for you under those circumstances.
/showme Redis Clusters breakfast cereal
I appreciate the update. I didn’t know you still worked for this racket.
@Shawn Is this why I’m unable to order a product?
I’ve been trying to order the photo frame offered earlier and keep getting this error message:
YOU DON’T HAVE PERMISSION TO ORDER THIS
Others are having the same issue. Can we please get this fixed before the Mehithon ends??? Thank you!
@Kerig3 that’s the error message you’d receive if you tried to order from the members-only offers of the Mehrathon. We’ll try to make this more clear for the next one.
@Kerig3 @shawn FYI - The member banner disappeared at some point. It was there when it first started but when I looked later it was gone.
@Kerig3 @shawn @speediedelivery Yeah it disappears after member hour is over but the purchase is still limited. Seems like an oh shit thing to me.