OHSHIT REPORT: The emails will continue until morale improves
30When something goes terribly wrong here at Mediocre Laboratories we draft up internal memos called "OHSHIT REPORTS". We use OHSHIT REPORTS to help share details across the company about what went wrong, coordinate work across teams for how we're going to fix it, and dive deep into root causes. Here's a copy of a recent internal OHSHIT REPORT we created.
tl;dr
We recently gave all our VMP members free socks. @mediocrebot placed thousands and thousands of free orders into our Order Service on their behalf. Unfortunately, some customers received more than one order confirmation email. If you were lucky enough to have this happen to you, you're only getting one pair of free socks. #sorrynotsorry
If you want the nitty gritty technical details about what happened, then read on...
Background
In order to optimize throughput when lots of orders are being placed at the same time, we defer some parts of the order pipeline to be processed after an order has been created. This technique allows us to receive more orders in a shorter amount of time than if we attempted to combine all portions of the order pipeline together when a customer places an order on the website. This is obviously super important when a Fukubukuro deal goes up for sale.
Critical portions of the order pipeline are processed synchronously (payment authorization, available inventory checks) and the remaining work (updating reports, updating internal dashboards, sending transactional emails) is queued for asynchronous processing by other applications and services.
The publish–subscribe pattern
To perform asynchronous processing we utilize the publish-subscribe pattern. The website creates an order using the Order Service. Upon success, the Order Service publishes an order-created
message to the Microsoft Azure Service Bus. Any of our websites or micro-services are free to subscribe to order-created
messages if they are interested in being notified when a new order has been created. Subscribers get their own queue of messages and are responsible for processing these messages asynchronously and independently.
Asynchronous message processing
There are two modes a subscriber can choose to process their queue of message. We use both in different scenarios.
The fire-and-forget mode is the simplest. The subscriber is delivered a message from its queue in a single-shot. The subscriber is then expected to process the message but the queue does not care if the processing is successful or not. We use this mode in scenarios that can tolerate if there is an error processing a message (as triggers to update some realtime stats, updating dashboards, etc).
The two-phase commit mode is more robust. The subscriber is delivered a message from its queue, processes the message, and then is expected to notify the queue that the processing was successful. If the queue is not notified that the subscriber has successfully processed the message it is redelivered to the subscriber after a specified timeout period. We use this mode in scenarios where we need to be absolutely sure that we've successfully processed every message (final captures of credit card payments, transactional emails, etc).
Root cause
The transactional email subscriber of order-created
messages responsible for sending order confirmation emails was able to receive and process messages just fine. Under the load of @mediocrebot placing thousands and thousands of free orders, when trying to notify the queue that the processing was successful the Microsoft Azure Service Bus was occasionally responding with an error message of: The server has terminated the request (server busy). Please wait a few seconds and try again.
The queue for this subscriber was configured with a 5 minute timeout. After 5 minutes the message was redelivered to the queue and successfully processed again resulting in a duplicate email.
So, what are we going to do about it?
Well, on Wednesday we coded up a quick fix that we pushed out to production so that we could get the remainder of those free sock orders placed. On Thursday we worked on cleaning up that quick fix and making it something more permanent that we can add to our lower-level code libraries. On Friday we hope to start patching up some of our service bus code. We have a lot of it so it's going to take some time.
We truly are sorry about the extra emails. I know it's pretty annoying but we hope you enjoy the free socks anyway.
P.S.
If you followed along and understood all this mumbo jumbo, we have all sorts of tech jobs open: https://meh.com/jobs
- 30 comments, 53 replies
- Comment
Definitely tl;dr for these eyes.
But all those fancy charts and words say to me: "you folks jammed up yourselves with your own elegant design cleverness."
Now what about the nonsense of getting an email for every single response in the thread?
THAT was grade A bullshit.
The feeling of hundreds of emails of incredibly banal comments felt something like this.
@WilhelmScreamer That's just the way the forums work over at Mediocre and has nothing to do with the above issue. If you post on the (slower) forum over there you become subscribed to the thread and will receive email notifications when others post. It's really easy to unsubscribe, though.
@PurplePawprints It's never happened to me before, I've only gotten emails for posts specifically replying to my own
@WilhelmScreamer I got a few random email notifications earlier in the day. It was a bit odd and made me wonder if I had been playing in my email settings and totally blanked on doing so. You know, some days the blonde really goes all the way to my brain.
@WilhelmScreamer it happens automatically on mediocre.com forums, not meh forums. We should probably make the email design more distinct due to this confusion. cc @dave
@PurplePawprints actually it's not. I've given up on trying, it times out every time. And no, my brand new right red midlife crisis gaming computer is not old and slow.
@Cerridwyn What exactly is timing out? I've never had a problem clicking the unsubscribe button on the threads at Mediocre. Or, subscribing when I want to follow a thread I haven't posted in, for that matter.
@PurplePawprints i actually had the same problem. Clicked unsubscribe and it wouldnt connect through. I gave up and woke up the next morning to find over 2gb of data downloaded by my phone from all of the emails. Not anyones fault, but i dont have unlimited data.
I love purple.
@snapster Have you tested the unsubscribe for those on mobile? I had issues unsubscribing on my android phone, but since I wasn't expecting emails in the first place I figured the entire thing was FUBAR
But the real worst part of it all? I wear size 14 shoes, so the socks probably won't even fit!
@Jocosity Oh god, this. Size 13 sucks.
@Jocosity I'm sorry. I know how hard it is to get size 13 or more specifically 12.5-13.5 wide dependent on the brand and cut of the shoe. I sometimes envy the size 14-15 crowd as my size seems to be very popular at local retail shoe shopping outlets. I always see tons of 14 and 15 with almost anything in 12.
I know the sock thing sucks. Size 12 socks are slightly small for me and the xl socks are slightly larger than required.
So out of curiosity do you have a hard time buying shoes?
@sohmageek I am also a size 13 shoe. I have a hard time finding shoes and shop by size first, then style, then price. No point in looking for a style you like first when chances are they won't have it. Not many choices when you're looking for something that will fit, not be $150, and not look like a couple of colorful ski boats from the '80s.
@medz @Jocosity @sohmageek Some people think the USA shoe sizes have something to do with inches but it doesn't. There is only a 1 inch difference between a size 9 (10.5") and a size 14 (11.5").
Socks usually stretch a little so I'm sure if you normally take a size 14 shoe, you can easily fit into a size 13 sock with less than 3/16" difference (interpolated). http://www.i18nguy.com/l10n/shoes.html
@Jocosity Meh... whiners, the lot of you. Cry a river to the guy with size 18 shoes -> this guy.
@WilhelmScreamer 13 here too @jocosity
@Jocosity I get over-the-calf socks. They come up to the point normal socks go on normal people. They're great.
Oh, all right then. I assumed I got two emails because yall were either giving out 1 sock or 2, and if you got 2 emails like me it meant a pair of socks, instead of just 1 singular mediocre sock.
@Tiamat114 I also assumed that they would send the socks one at a time.
@imppersonal @Tiamat114
But...but....it doesn't show up in my order....
@eeterrific You probably are looking here and not over there. Try this: https://mediocre.com/orders
Speaking of job openings, that sounds great! Do you have any openings for a hard working virtual janitor? I can accomplish many things remotely and have tons of trash-to-cloud experience.
@unkabob Just gotta say that I LOVE that guy - the Ancient Aliens "scientist" on the History Channel - such greatness.
@pepsiwine @unkabob He is named The Hair Guy at our house.
@Pamtha @unkabob Yes! It takes a very confident guy to pull off a hairdo like that!
@pepsiwine.. @pamtha ... (sorry, but he's not a scientist)..You ready for this? Meet: Giorgio A. Tsoukalos, Publisher: Legendary Times Magazine.. (I call him the hair guy too).. Always a joy to watch and listen to. He definitely loves his job.
@unkabob He's got a new show, "In Search of Aliens." It's brilliantly bad, but in a good way. It's one of my guilty pleasures.
@LaVikinga ... Mine too. Much more interesting hearing about aliens (from another planet of course) than the last six years of dirty politics. They're both unbelievable yet one could be true.
@unkabob Dirty politics - that is true - it is unbelievably dirty… LOL. With respect to aliens (and their space craft)… if NSA found an alien or space craft I have it on good authority from someone in the extended family who is high up in NASA that they'd parade it up and down the DC mall because of the budget bonanza they'd get from that. I am sure the military would get in on it too - must. destroy. space. aliens. must. protect. our. budget.
@Kidsandliz ... Maybe NASA is the aliens (OMG!!) Are you sure your extended family member is from this planet?☺.. (I know some of my in-laws are suspect).
@unkabob Reasonably certain - if not they are stuck here now as the rockets we always shoot off the car battery at the farm, even with an official NSA count down, did not reach the moon, the top of the big hill, and sometimes not even as high as the old apple tree LOL. Houston they have a problem.
@Kidsandliz ...
So I could follow that. But I don't think following al that qualifies for any of the jobs you have listed. They seem to want a bit more experiance than can follow along with the ohshit reports. Also I don't think that it said we could be remote and Texas is one of the states furthest away from me. However it sounds like you guys have fun with your work. Also thank you. And seems like you guys may need another test to work out the bugs. But sending another product would cost more money.. Why not cancel all the orders and send them again? :)
@sohmageek I followed it and live mere blocks from Meh! But I'm happy with my current job.
Best OHSHIT REPORT title yet. +1 for this failure.
@shawn how was this related to my 5 emails about batteries shipping? Were they just sent at the same time you were entering tons of free sock orders? Or were there tons of battery shipping emails that should have been sent simultaneously? I only got one (pair of) sock email, but 5 battery ones.
@kadagan Same problem but different place in the code. Our upcoming fix should work for both cases.
@shawn I understand message queuing, etc but I'm not specifically familiar with MS Azure. I assume that you communicate with it using AMQP (something that I've been interested in but have no need for at this time) and that your application is not .NET but node.js? Using rabbitmq maybe?
Seems to me that you are building a platform that should be scalable to millions of users. Nice. Let me know when the IPO is.
@Headly The Microsoft Azure Service Bus supports the AMQP protocol but we use their officially supported Node.js library: https://github.com/Azure/azure-sdk-for-node
@shawn Did I just see official Microsoft code on GitHub? Runs to look out window... Nope, no flying pigs. My how times have changed.
If the service is producing an error and still sending the message, are you certain the service isn't going to error and not send the message? It sounds kind of like the error is in an unreliable 3rd party service, and you're just ignoring the error message.
@smartr I read it as the server not sending the acknowledgements within the timeout window. I could be wrong.
@smartr @Headly the service bus is producing an error when trying to delete the message (so it doesn't get processed again) -- we only try to delete the message after we successfully send the email.
Oh yeah? Well I got FIVE confirmation emails! I was expecting to be drowning in socks soon. Alas, just one pair. Yay free socks!
I only got one confirmation e-mail. I don't feel loved enough.
I'm not getting any socks because I ordered the batteries for a friend in Boston and that became my primary address so the socks were sent to that address instead of mine. Contacted CS and was told address cannot be changed.
The exact same thing happened with the TOCCS connector. I will never get either of them but I'm sure my friends in Boston will enjoy my free gift. She already claimed the TOCCS never arrived and I'm sure I'll get the same story on the nice socks too. boo!
@cengland0 I've also ordered stuff for friends who were too cheap to pay 5 bucks shipping or for VMP or had something better to do on Christmas day. But they are in the same town, so no address change. Tell your cheap-ass buddy to either move or get their own meh acct!
P.S.: You're probably already aware you didn't miss much with the TOCCS connectors.
@RedOak The batteries were a gift and she doesn't even know they are coming so it didn't make sense for them to get their own meh account for something I expected to pay for.
Regarding the TOCCS connectors, I actually bought some when they were offered on August 23 (https://meh.com/deals/tocc-s-snap-cable). I like them and make a good connection between my phone and emergency external battery. I would like to have received the extra one but it was sent to the wrong person.
The fix I have for this problem now is order the gift and then immediately delete that address so meh doesn't inadvertently send freebies to it again.
@cengland0 I feel your pain. I think deleting the address should work from here forward, but it also seems like an unnecessary extra step that one could easily forget. Plus you have to re-enter secondary addresses all of the time.
@joelmw Yeah all they need to do is have a space for shipping address… not rocket science.
I thought it was just a separate notice for each sock.
@MrGlass Same here. I was excited to be getting a pair, instead of just the one sock I am expecting now. :(
I'm gonna claim this wouldn't have happened if you were using RabbitMQ or Celery.
'snot true, but I'm gonna claim it anyway, because I'm irrationally allergic to Azure. ;)
@pwinn I love ochre.
I love purple.
I'm getting free socks-so what if I got multiple emails?
On the other hand, if I'd placed an actual order and got multiple confirmations I might have had a heart attack thinking I was going to be charged multiple times. But this was free, so no worries.
So, basically, we helped meh with QA testing and will be paid in socks. Works for me.
Oh man, I didn't know you had job openings! I would have sent you my resume when I was looking for a new job (just started a new role this week).
Other than the whole relocating to Texas thing, which I don't think I'd ever do, but when the weather for Chicago was like this past week, it's kind of tempting...
I think we're all Bozos on this Microsoft Azure Service Bus.
http://firesigntheatre.com/media/audvid/bozos1.mp3
If I'm reading this right we sometimes have a similar issue with credit/debit cards. When a merchant runs a debit/credit card it goes through our service provider to us for approval. We then send that approval back through the service provider to the merchant. Sometimes that approval doesn't get back in time and the merchant is timed out. So we have the merchant seeing a denial while we place the funds on hold as an approval. The merchant swipes the card again and again until the card is maxed out (credit card) or the customers entire checking account (Debit card) is placed on a merchant hold. That's when the fun really starts.
@Mehrocco_Mole this would be why I generally use cash… yah know that green and white stuff that is generally wrinkled, filthy and germ laden.
I love purple.
@Kidsandliz Don't forget the (sniff sniff) white powder stuff.
Is the fix supposed to be working? Because I got 5 shipment notification emails this evening.
@jqubed I got 3 this afternoon. looks like they like you better.
@ekw I got 4 this afternoon, so they like me better than you, but not as much as @jqubed
@jqubed I got none, so they must not like me at all. Maybe I should have guessed that from the giant bubblegum machine.
@shawn Bad news and good news. Bad news we are getting multiple shipment emails. Good news you get to do another deep dive oh shit report into the root cause..
@readnj ha. "good" news.
Fuck morale.
I read these reports every time. Yet I understand nothing reported in them. I do find the content interesting though. Thanks for the explanation.
@jimmyd103 I read them every time too… As far as I can tell, translating to English (my knowledge of technogeek is limited to two programming classes many, many years ago) is: "Our programers screwed up originally and/or the program has a bug. We are now trying to figure out what is wrong to fix it." Then come suggestions from the peanut galley: "Have you tried [untranslatable gibberish] to fix it? and "We use X commercial program which has already been beta tested before being rolled out and it works/can handle the load…" (underlying text "why are you so cheap to be using the shit youare are using?") and "Would you like us to try to break it again for you?" and "I love purple".
I love purple.
@jimmyd103 I read them too, then get excited when I understand half of it.
I read them every time and understand them, and it's indeed very interesting.
@shawn No idea if it helps in your specific scenario or if you're using BrokeredMessage, but I use a lock renewal thread for processing messages that will take longer than the SB queue timeout. It stops the messages from going back into the queue. The basic pattern I use is to spawn a background thread that calls message.RenewLockAsync every 30 seconds (half my message timeout of 60 seconds). After that thread is started and we get the cancel token, the processing of the message itself is wrapped in a try-finally and as part of the finally, we cancel the lock renewal thread to make sure the message will be abandoned/deadlettered in the case of an exception.
Yay! Called it.
https://meh.com/forum/topics/free-socks#54e532f2831a2bf4091e43f4
Cool stuff. (Yes, I understood it.) Not interested in Dallas tho and doubt I'm really looking for a new job... more likely just a general malaise with life, the universe and everything.
Happy-Happy, Joy-Joy.......I am ESTATIC about the new socks enroute! Since my still-worthless/upgraded/best- Verizon-internet-connection-available out here where we live in the sticks/country continues to render us inevitably unable to ever score a bag of Fuku, I can only rejoice in this, AND BEG the @snapster">https://meh.com/@snapster for yet another event of unmitigated joy akin to the recent neoprene event, which was thoroughly enjoyed by all!
@scfd0766 I'd love some more neoprene... I just finally got rid of most of mine... How funny would it be if it was on a tuesday? :)