OHSHIT REPORT: The emails will continue until morale improves

30

When something goes terribly wrong here at Mediocre Laboratories we draft up internal memos called "OHSHIT REPORTS". We use OHSHIT REPORTS to help share details across the company about what went wrong, coordinate work across teams for how we're going to fix it, and dive deep into root causes. Here's a copy of a recent internal OHSHIT REPORT we created.

tl;dr

We recently gave all our VMP members free socks. @mediocrebot placed thousands and thousands of free orders into our Order Service on their behalf. Unfortunately, some customers received more than one order confirmation email. If you were lucky enough to have this happen to you, you're only getting one pair of free socks. #sorrynotsorry

If you want the nitty gritty technical details about what happened, then read on...

Background

In order to optimize throughput when lots of orders are being placed at the same time, we defer some parts of the order pipeline to be processed after an order has been created. This technique allows us to receive more orders in a shorter amount of time than if we attempted to combine all portions of the order pipeline together when a customer places an order on the website. This is obviously super important when a Fukubukuro deal goes up for sale.

Critical portions of the order pipeline are processed synchronously (payment authorization, available inventory checks) and the remaining work (updating reports, updating internal dashboards, sending transactional emails) is queued for asynchronous processing by other applications and services.

The publish–subscribe pattern

To perform asynchronous processing we utilize the publish-subscribe pattern. The website creates an order using the Order Service. Upon success, the Order Service publishes an order-created message to the Microsoft Azure Service Bus. Any of our websites or micro-services are free to subscribe to order-created messages if they are interested in being notified when a new order has been created. Subscribers get their own queue of messages and are responsible for processing these messages asynchronously and independently.

Asynchronous message processing

There are two modes a subscriber can choose to process their queue of message. We use both in different scenarios.

The fire-and-forget mode is the simplest. The subscriber is delivered a message from its queue in a single-shot. The subscriber is then expected to process the message but the queue does not care if the processing is successful or not. We use this mode in scenarios that can tolerate if there is an error processing a message (as triggers to update some realtime stats, updating dashboards, etc).

The two-phase commit mode is more robust. The subscriber is delivered a message from its queue, processes the message, and then is expected to notify the queue that the processing was successful. If the queue is not notified that the subscriber has successfully processed the message it is redelivered to the subscriber after a specified timeout period. We use this mode in scenarios where we need to be absolutely sure that we've successfully processed every message (final captures of credit card payments, transactional emails, etc).

Root cause

The transactional email subscriber of order-created messages responsible for sending order confirmation emails was able to receive and process messages just fine. Under the load of @mediocrebot placing thousands and thousands of free orders, when trying to notify the queue that the processing was successful the Microsoft Azure Service Bus was occasionally responding with an error message of: The server has terminated the request (server busy). Please wait a few seconds and try again.

The queue for this subscriber was configured with a 5 minute timeout. After 5 minutes the message was redelivered to the queue and successfully processed again resulting in a duplicate email.

So, what are we going to do about it?

Well, on Wednesday we coded up a quick fix that we pushed out to production so that we could get the remainder of those free sock orders placed. On Thursday we worked on cleaning up that quick fix and making it something more permanent that we can add to our lower-level code libraries. On Friday we hope to start patching up some of our service bus code. We have a lot of it so it's going to take some time.

We truly are sorry about the extra emails. I know it's pretty annoying but we hope you enjoy the free socks anyway.

P.S.

If you followed along and understood all this mumbo jumbo, we have all sorts of tech jobs open: https://meh.com/jobs