OHSHIT REPORT: Fukubukuro 4
76When something goes terribly wrong here at Mediocre Laboratories we draft up internal memos called "OHSHIT REPORTS". We use OHSHIT REPORTS to help share details across the company about what went wrong, coordinate work across teams for how we're going to fix it, and dive deep into root causes.
I thought I'd publicly share an OHSHIT REPORT with some details about our recent Fukubukuro deal. This is going to get pretty technical but I'll wrap it up with a summary at the end.
Background
At a high-level, the websites we create at Mediocre Laboratories follow a microservices-based architecture. We like to poke fun at Groupon, but their engineering team did a good job talking about the advantages of microservices over a monolithic-based architecture. Lots of sites are moving toward this architectual pattern. Amazon's doing it. I took Woot in this direction after the Amazon acquistion. We started Mediocre Laboratories from the beginning using microservices.
So you can think of meh.com as an application that relies on several other microservices to do its work. We have services for things like managing account creation and authentication, storing payment settings, displaying deals, showing polls and videos, placing orders, managing forum content, calculating stats, etc. Each one of these services runs on multiple servers behind a load balancer to increase performance and ensure if one server goes down the service continues to be available.
To put things into perspective, we typically run about 14 servers for our microservices. Since we're running in the cloud on Microsoft Azure we scaled up to 36 microservice servers the day before we sold Fukubukuro 4. This isn't counting the 8 servers running meh.com, 4 servers running mediocre.com, plus our database clusters and caching servers.
So, did we have enough servers? Looks like it. Here's some performance data from one of our meh.com servers. CPU not much above 50% and the memory, disk, and network all look good.
Ok, we should have enough servers. So what happened during Fukubukuro 4?
The Fukubukuro deal started at midnight ET, we got slammed with traffic as expected, and we quickly took 259 orders. Then the meh.com error rate spiked to 15%.
Orders stopped coming in immediately. It's not that they were coming in slowly because the site got slow. They just stopped completely. So what the hell? We can't take more than 259 orders? Where were the errors coming from?
The purple part of that graph above is our Account Service. It's the service that handles account sign ups, sign ins, and storage of payment settings. The average response time of this service spiked to insane numbers.
We begin investigating what was causing the Account Service response time to increase. I fully expected this to be related to retrieval of existing payment settings, or doing credit card postal code and security code checks when entering a new payment setting. Instead we identified a huge performance bottleneck when looking up thousands and thousands of OAuth tokens for users who are signing in and signing up for Mediocre Laboratories accounts using Amazon or Google. This is something we didn't encounter last time we had a Fukubukuro deal. It's somewhat related to the increased number of accounts in our database but mostly due to some horribly inefficient code we had handling third-party authentication.
It took us a bit to identify a viable performance optimization, but we coded it up and deployed it to all 8 servers running the Account Service at 1:05am ET. So we're good, right? Order Service, how you doing?
Yikes. Again, we investigate what's causing the Order Service response time to increase. Again, I fully expect this to be related to creating orders and processing credit cards. Instead we find a bottleneck looking up previous orders to ensure customers are limited to only buying one Fukubukuro. Again, somewhat related to the increased number of orders in our system but mostly due to some horribly inefficient code.
We coded up some performance optimizations and deploy them to all 8 servers running the Order Service at 1:28am ET. We waited for the deployment to complete before moving on to the next bottleneck and then... wait, what? We're sold out already?
Yeah, the performance optimzations worked and the next 870 orders were placed in 145 seconds.
So we're good for the next Fukubukuro right?
Let's not get carried away. I'm encouraged that homepage benefitted from some performance tuning we did after our last OHSHIT REPORT. Even though the Account Service and Order Service had issues, the homepage was generally available while we were fixing other performance bottlenecks. This wasn't the case for Fukubukuro 3.
@snapster and I were working on a supermarket analogy today. For Fukubukuro 3 we had trouble letting everyone in the supermarket front doors. For Fukubukuro 4 we fixed it so that everyone made it through the front doors, but this time they all stampeded the meat counter and overwhelmed the butcher. For Fukubukuro 5 we'll have fixed the butcher but you'll still need to make your way to the cash register, hand over a coupon, pay for your meat, make it back out the front doors without getting trampled, then hop in your car and hope you don't get into an accident in the parking lot.
Or to put in another way: fixing the next bottleneck helps you identify the next bottleneck.
Are we satisfied with the way Fukubukuro 4 turned out? Not even close. We're going to continue to work on making Fukubukuro run smoother and smoother. Fukubukuro 5 will be better than 4, Fukubukuro 6 will be better than 5. My goal is to keep improving them one after the other.
Oh, and what's the deal with the CAPTCHA stuff?
We recently released a developer API and @DJP519 was concerned this would lead to auto-purchasing Fukubukuro bots. We also really don't want to see bots buying Fukubukuro. We really, really want to see them go to our human customers. We'd rather do CAPTCHA than make you jump through a bunch of silly hoops to find secret Facebook links to prove you're a human. We'll likely be using Google's new reCAPTHCA on the checkout form, but only for Fukubukuro or other special events.
tl;dr
The site didn't work the way it was supposed to during Fukubukuro 4. We didn't enjoy that. We're going to continue to improve things. We're looking forward to Fukubukuro 5.
- 104 comments, 156 replies
- Comment
Thanks and good job, that is some serious on the fly fixing when you release the details of it.
This shit wouldn't happen if everyone would go back to GeoCities.
The transparency is one of the best things about this site. Truly. Much as I'm sure it was a headache and panic moment, bug hunts in code are always a thrill. :) Good hunting, sirrah!
Was Account Services why Labs didn't recognize my (correct) password as valid?
Sometimes, purple really sucks, when it's spiky.
@G1 yes
@shawn Thanks! Educational. I'm pretty sure I'm now smart enough to start a multi-national sales website. Can I get by with ONE Raspberry Pi, or will I need 2?
@G1 start with 1 and when it crashes buy another. in my experience customers don't mind the downtime.
@G1 @shawn Holy crap!! I didn't notice the Y scale when I first read the article. 300 million milliseconds is a 300,000 second response time. That's over three days!
@shawn Jolly good show.. had a cute cartoon but I will be darned if I have learned how to link one!
@mikibell You almost had it. I did manage to make it to it.
I actually will dread the day you guys fix all the bottlenecks. It won't be as special or as much fun when the Fukus sell out in the first half second and everybody just goes back to sleep.
@kuoh if I continue to suck at my job, I don't think we'll ever get there
@shawn Then may the suck be with you.
@kuoh if we ever reach that point, I'll hide some bugs in a feature addition. I got your back!
@kuoh
@harrison @shawn Please, God, no.
I jumped on you guys after the last Fuku, but that was when you were having problems almost every night for about a week. But then you posted a report like this one and I could see you were actively working on it, and you were displeased as well. That really helped temper my temper. This time I held back on expressing my feelings in hopes you'd post another report, and you did. Thank you for these. I find them interesting and well worth reading.
@hallmike BTW: big shout out to @harrison and @lukeduff who identified the bottlenecks. for a while there I was wondering how we were ever going to sell 1000 of these things.
@shawn not to mention @katylava talking through fixes with us!
@harrison ha yeah. hopefully i'll learn more about performance and scalability over time and actually be identify problems someday. or we find a dev ops person and i don't have to worry about it?
Thanks for the writeup. I love knowing exactly what went terribly, terribly wrong.
Shawn appreciate the honesty and detail. If your expectations were all met, the site wouldn't have had issues, Most ORT or UAT would never have uncovered these unfortunately.
@readnj I think we should have caught the issue with the Order Service through some straightforward load testing. But to uncover the issue with the Account Service we would need something like tens of thousands of Amazon and Google accounts trying to sign in and sign up. So hard to simulate Fukubukuro traffic.
@shawn We can help you test again. Just offer another Fukubukuro tomorrow and we will help isolate those bugs for you.
@cengland0 @shawn I am sure all the VMP members would love an email inviting us to try to break the servers buying another fuku.
@Kidsandliz OK with me- I couldn't even get to the order page....
@shawn I used to project manage a team that did performance testing on FedEx.com. Couldn't you just fake the Amazon and Google services and their response on your test servers?
@Kidsandliz I know I would.
I really appreciate this. I sadly gave up before the last bottleneck gave way but somehow it feels better. Thank you.
@smoo99 Same. Thanks for all those bottles you popped, team. @shawn @harrison @lukeduff @katylava
@shawn It seems like the email notifications of @ mentions is broken. I haven't gotten any notifications since Fukubukuro 4: The Accounting. When I go to mediocre.com/account and try to change that setting, I get Something Went Terribly, Terribly Wrong. Of course, since it's broken you might not get an email notifying you of this mention.
@SSteve looking into this
@shawn Just got one. And it was from @unixrab. Awesome!
@SSteve Ditto.
@SSteve Yeah, I haven't gotten any notifications since, well, I don't know, because I haven't gotten any notifications. I just assumed everyone hated me.
@shawn Subscriptions to a thread wasn't working earlier either. The Daily Steals RUSH thread on Mediocre.com should have sent me an email even though I wasn't mentioned specifically but it didn't.
@cengland0 and I think we ALL know why... even the nanites deplore you!
@mossygreen Because notifications are awesome!
@SSteve I haven't been getting notifications lately either. I wasn't too sad about it because I forgot that I wasn't getting them - if that makes sense.
@SSteve Me toos!!
"...the performance optimizations worked and the next 870 orders were placed in 145 seconds." The analogy for this? Using the plunger.
Thank you for this. Amazing to see such transparency in a business. Oh, and I am proud to be one of the 259!
It's really cool that you post this stuff. It's actually incredibly interesting to read about what goes into this. And yeah, I did think it was pretty impressive the homepage kept loading instantly. It does say a lot about how well cloud services can help with flash sales like this. It's impressive that we can't cripple your page when stuff like this comes up. Really well deployed infrastructure, all things considered.
@Bingo Also very cool that they can scale up from, what, 14 to 36 servers, basically just for one day. I presume they scaled back down this afternoon, so they only have to pay for the excess capacity for the peak period.
@DJP519 We did scale back down today, but here's a cool thing I didn't mention: we automatically scale up for midnight traffic every night and then automatically back down at 2am.
@shawn That is pretty awesome. Behold, the power of the cloud. Or something.
@shawn Can we set it up so that when my account attempts to make a purchase you automatically scale up as well?
I saw @shawn tagged me in that summary, and I thought... Oh, God... it's my fault. I don't want to be the goat! Thankfully, it sounds like the CAPTCHA process wasn't the bottleneck and wasn't what broke. Thank God!
Everyone vote @DJP519 for February goat
Captcha broke for me at least once. After the first bottleneck cleared around 1:40(Eastern) I made it through to the cart, clicked the captcha checkbox and it timed out. Then I got a browser error that the google captcha service failed to reload with invalid parameters. And reload took me away from my carted fuku and back to the home page. Can you capture the failed captchas from last night?
That "App server historical average response time" graph showing 900 second response times must have been pretty heart-stopping.
@SSteve for reference here's the past 60 minutes (what I'm used to seeing)
@shawn When you need log scale to show high and low response times you know you're screwed!
If you haven't read 'The Goal', you should. It's all about queueing theory and the nature of moving bottlenecks, batch sizes, and the like. You've got a few 'Herbies' you need to take care of.
Great report BTW. Thanks for posting it. I too was one of the lucky 259!
@ACraigL Indeed, this is a great book. If you guys don't have it and still read paper books I would be happy to loan it to you...I can even drop it off.
@Shawn thanks for the report, looking forward to the next Rampage
I love purple.
@Barney Purple indicated badness in this graph, but we forgive purple because we love it. Besides, it's not purple's fault that someone chose it to indicate badness.
I'm all about the stats, 'bout the stats.
@FredGeekPoop ...no bottles.
Thanks for the thorough explanation @shawn. Glad that we can be your capacity and performance testing team :) You can send payment to us via "we'll send you that thing we forgot" :D
Great... NOW the fuku's will be gone in 200 seconds.. ...
@unixrab Like I said during Fuku #2: Never again will it be this easy to buy a fuku. Looks like I was right. https://meh.com/forum/topics/fukubukuro-craps-of-innocence#541909ee575f266c085a2462
@unixrab it is acceptable. Many other entities shall still be mistaken as a robot, allowing us human not robots to acquisitionize the items in positive timing. We who are not robots submit greetings. /end_xmission 46427
@unixrab in 2011 one of the bag of craps sold out in 8 seconds .
@communist well... I have to say:
@unixrab I actually imagine I can see him saying "hashtag"
@chellemonkey Actually I said "hashtag don apostrophe tuh like it" yelled of course for all caps... then I realized I had my literal verbal compiler stuck in the "on" position. Fixed it. hashtagawesome
Thanks for the transparency and thorough break down. You don't really get that much anymore. Here's to Fuku 5!
Thank you for the explanation. Just knowing where the fauls tlied was good; to resolve it too? @shawn, @harrison, @lukeduff, @katylava, and all the other meh staffers - huge kudos!
On the bright side, I really enjoyed the "Terribly Wrong" song.
@bradandcoffee The first 5 times!
@bradandcoffee It got so bad for about an hour that the song didn't even load for me :) Although... It was funny, My wife hates that my alarm goes off at midnight to check Meh... (I used to check woot too, but given that they are 1 AM I don't want to stay up for the hour...) The alarm didn't wake her, but the first time of terribly wrong did... I muted it after IrK had the first 3 words... She said something like Dan, stop it.. then when it was silent, she continued to sing it in her sleep :)
@shawn, I was told that node.js and mongodb are webscale.
Joking aside, did you see issues with blocking the event loop with your inefficient code that caused the throughput to completely halt, or was it something else?
@joneholland It's deeper than I wanted to get in my post above, but the event loop is something we need to do a better job paying attention to. Need to find some good Azure tools for this. In both cases above our Restify services were sitting around waiting for MongoDB to return data. Having an index that covered both our first-party and third-party authentication paths was one of the fixes. I think we were likely getting by with "fast-enough" table scans on OAuth tokens when the number of accounts was much smaller.
@shawn I assumed there was a non blocking mongo driver? Sounds like your code blocks the loop while waiting on mongo IO?
@shawn Brings back the days when the prof would hit me on the head yelling "Always have a way to exit the loop BEFORE you start it, you idiot!" as I crashed all his shit. This is why I went on to sales and not computer science . . .
@joneholland Mongoose and the native Node.js MongoDB driver are non-blocking. Our code doesn't block the event loop while waiting on Mongo I/O. I can't say for sure but I don't think we had blocking event loop problems. What we did have was a MongoDB server that was taking minutes to return queries once it hit 97% CPU. After some new indexes were built it immediately reduced CPU load to 3%.
I like purple.
@Barney whoah, we broke Barney, now you only LIKE purple. Trying to fly low on the "are you a robot radar" now, huh?
@TaRDy Barney often switches between like & love. I'm not sure if there's a pattern.
as a guy who received a work item two months and about ten distinct redesigns ago, I appreciate the hell out of your honesty and transparency. here's hoping your day to day gains keep up.
So, Fukubukuro 5 hits and Irk gets to sing the "Something went terribly right!" song?
@Pavlov SBTRKT. Good pull.
RIP Sam the Butcher!
Thanks my ninjas.
Am I the only one worried about Meh fixing a butcher? If the butcher wants to have kids, then let him. Fuck...
@LindyNC73 Yeah, let him fuck.
As a fellow IT geek, but not dealing with selling stuff, thanks for putting the OHSHIT reports out publicly. I know how heart stopping it can be to see a massive spike like that and the pressure to quickly solve the problem, especially on a production system.
The unfortunate thing is that if the future fukubukuru's are also limited to 1,000 then the sellout will be approx 4 minutes, based on historical data. Might be fun to release them in smaller batches in windows of time...say 100 per hour or so. Without that kind of limiting you soon will have fixed the big bottlenecks but the smaller/flatter ones will still persist...you need to simulate fukubukuru traffic more often.
@tightwad No
@tightwad No. They would literally sell out in 5 seconds each time they released a batch. At least with a 3-4 minute window, if you're on the ball right at midnight you have a good chance.
I'm still sad
And yeah, that's dangerous @lindync73, very dangerous
I was on at 9, i could meh, i could do the poll, i could not order, well 9 pacific time a good time for a FU
me cries in her coffee, and if there was coffee in the FU, someone will have to die.
So those of us that repeatedly got the Captcha illegible, Cyrillic writing that we had to decipher are just shit of of luck in the future?
Oh well, I had a good run of Fukus and not snagging one doesn't mean I won't keep trying in the future.
Thanks for the explanation and keep on keepin' on guys!
thanks for sharing that. also, great job not only finding the source of the bottlenecks, but also quickly fixing and deploying solutions. glad to see the backend dev team is far from mediocre.
@gato for the record backend dev team and frontend dev team are the same people, for the most part.
Thanks for the write up. And thanks for working to fix all this stuff.
So does this mean that the reason I got a fukubukuro at 12:01 Eastern is because I'm logged in with a Mediocre account and don't have my Amazon or Google accounts linked? Or did I merely get in before the OHSHIT hit the fan?
@SSteve Same here... except I was logged in with Google. I think our advantage was being already logged in and quick on the draw.
@jseay65 @Ssteve I believe both of these statements to be true
@SSteve So, lesson learned today: log in a few minutes before midnight. Son of a...
I was just sad because I wanted fukubukuro 4 to be my first. It was going to be special: I was going to wine and dine my special red sack and show it the wonders of this great city before inviting it into my apartment for some revealing fun.
Things just won't be the same when the next fukubukuro comes around. I don't know if it will appeal to me the way 4 did. Only time for this broken heart will reveal my feelings when the next one graces Meh's front page.
I like purple.
@JadenKale Dude, this is a family site. Don't talk about your sack like that.
@hallmike There was nothing bad said... Sacks are made for putting things in, carrying said items, and taking things out.
I chose to be humorous, not lewd. Marginally suggestive, without being indecent? That's much more fun to get a rise out of people. Admit it... You really wanted one too so that you could see what treasures could be found within.
Just sayin' guys... There would have been a photo montage. Food, candles, a walk on the beach... My first Fukubukuro would have been magical.
Thanks much for the heads-up! I have to get Azure certified for work (sigh) so its interesting to see even fairly generic info about it in use.
I like purple.
Great write-up and explanations for the crazy that was Fuku 4. I gave up at 1:05 after getting hung up on the checkout screen and now I know why. I really need to learn more about Node.js built infrastructures. I'm so used to .NET and I feel so behind on it. Sadly all I use it for right now is to minimize code and images. I should be right at home with it too, I love me some javascript.
I'd almost rather read your (clear, concise and coherent) ohshit report than actually get to buy a fuku, because I don't have to write reports like that any more. It's a delight to read about someone else's work problems. Keep up the excellent communications. Thanks for sharing the fun.
Thanks! It is so awesome to read about what the real people are doing behind the scenes, and what is really going on when things go a bit awry. Most of us have absolutely no idea how challenging all of this is, and what you need to do to make even the simplest of transactions go through seamlessly. I, fo one, appreciate all that you accomplish in the name of Meh!
Don't listen to them! It's a cleverly disguised diversion for their REAL inner workings!
@Jarf WHAT PART OF THAT MESSAGE DID YOU NOT UNDERSTAND?
@JonT YOUR FACADE OF STATISTICS IS A CARDBOARD WALL OF LIES AND SHAME. Just add the picture to the "ohshit report"
@shawn Can those of us in the 259 club get our very own badge???
@jseay65 That would be very cool
@shawn Ooohhhhh yes that would be lovely. The 259 club has quite the exclusive ring to it. It could be a crown with 259 written across it. And gold colored. Yes that would be suitable. LOL
@jseay65 259s are lame, the 870 club is where the cool kids are.
I too am a member of the 259 club. I feel very honored.
@BillLehecka All you smug bastiges can shut your collective pieholes. Although I did have a good time reminiscing about the good ol days
@maderagirl Like this?
@Kidsandliz Yup! How'd you know?
Fwiw, WOW. impressed by your Kung fu and yeah, 800+ fuku orders in less than 3 minutes sounds about right. Jeez oh petes!
This is great. Thank you.
Thanks to all the Meh staffers- that's amazing being able to come up with, code, and deploy in such a short time frame. You rock!
I'm curious what the performance bottleneck was with 3rd party signin, and how you fixed it. I'm in the thick of making a service that handles all that for my company.
Also, thanks for staying up late and coding under pressure to get everything fixed in < 2 hours.
259 AND east coast ftw. Thanks for all the info guys. As someone who was once a Comp Sci guy, I'm still intrigued by the processes behind the scene.
@Cinoclav If you're on the east coast, I'd say being one of the 870 is more impressive - you stayed up for almost two hours of failure to suddenly make it in. (I did too, but it wasn't even even midnight yet)
@davidgro I did that last time. Turned out after wasting an hour and a half my purchase had actually going through right after midnight. This time I was lucky enough to have gotten it in about 20 seconds.
It wasn't the bottleneck that screwed me up. It was that damn Captcha. You see I actually signed on at the exact time of 12 EST, saw what was being offered and went right for the "Buy One Link", and without reading or seeing the Captcha keep pushing the "Buy" link.
It was a good 15 sec before I even saw the "I am Not a bot", then kept checking that box only to be rejected over and over......FINALLY realizing I needed to fill in he Captcha number BEFORE checking the "Box" and then puching the Buy Box. By that time it was all over. Guess I learned to be more observant in the future.
i was wondering before the fuku, and then during the fuku what kind of hours you guys work. do you take turns working each sale? do you just start your work day late? one of the things i really like is that there does usually seem to be someone around commenting on posts. I know you have experienced a bunch of bitching the past few days, but i really quite enjoy this website and the community it has inspired.
@vampje I used to work at an ecommerce site with sales that launched outside the workday. Basically everyone but IT worked regular hours, IT worked regular hours + being on call at all times and made oodles of money because of it. Hopefully since Meh is so small they're raking in all the money themselves anyway.
@vampje Our business is somewhat unique in that it sometimes requires the development team to work at midnight when new deals launch. We have core office hours between 10am - 4pm to give us a consistent window to collaborate and schedule meetings. Some folks work earlier or later depending on what projects are going on.
@vampje along with the devs, myself and usually a few other Mediocre people are around every night at midnight after working from somewhere between 7am-6pm.
@JonT @shawn i thought it might be something like that. I hope things go a bit smoother so you can get some sleep :)
@JonT I assumed you did the late night forum crawling at home in bed. Do you actually do that in the office?
@SSteve I am at home when I work at night, though not always in bed.
@JonT Ok, but if you're not in bed I hope you're drunk at least.
@SSteve only about 89% of the time.
Now, @JonT, that's only a B+. I think we both know you can do better than that if you apply yourself a little more. I want to see a solid A for drinking on your next report card.
This makes me feel better.
@cercopithecoid test
Thanks for this report, really interesting read.
TL:DR, I gave up about 3 minutes before you got it all running. Timing, I have it!
It is really cool to see the behind the scenes of what went wrong, how you fixed it, and how none of that will really affect my odds of scoring a fuku next time.
@Thumperchick Never Give Up.... AND NEVER SURRENDER!
@Thumperchick I think the same thing happened to me. Finally gave up and got ready for bed around 1:30. Before falling asleep, checked my phone one last time right before 2:00, and it was sold out. D'oh. If I'd only tried for a few minutes longer...
So I was right on Twitter last night. It was the authentication service that went down. (I've been in the SSO industry for 2 years now)
@jont Since I figured out the problem, do I get the chance to Buy a fuku now? :)
@j8048188 I feel like figuring it out before they posted what the problem was would've been more impressive, yeah?
@HELLOALICE
I did.
https://mobile.twitter.com/j8048188/status/555255334990671872
So was the fact that I have a meh.com account and not a linked account from somewhere else, was that beneficial?
@HELLOALICE nope, the Oauth from amazon and google actually broke the log in system
Worst fuk attempt ever.
I've only been in front of my computer to see this one live, and I was NOT impressed. And to have a round of Irk singing - nay TAUNTING me, well, Let's just say I gave up within 30 minutes.
@lowerone quitter
@lowerone should have became friends with Irk
@lowerone I turned off the speakers.
@lisaviolet @lowerone My speakers are always off/turned very low. Stealth mode FTW.
Yeah, if you are going to put in captcha to stop bots, fine. Just don't put in Facebook links or something similar. Believe it or not, there are still many people out there that do not have Facebook accounts, often over privacy issues, and some that don't even have cell phones (for real!, I know a bunch of them).
@Steve7654 I have only had a cell phone ($10 clamshell) for a bit under 4 years. I keep it in the 1990 ghetto van along with a phone book. And some towing business cards. Don't use it for much else (mostly because if I didn't keep it in the ghetto van I'd forget to take it with me when I went somewhere).
The reCaptcha isn't fair for people who Google selects to have to enter the Captcha manually (their mouse tracking algorithms rarely work if you're on a laptop).
@on I'm on a cell phone and it always has me enter in the word after clicking the check box, and it did so last night but I was still able to grab a fukubukuro.
@on I was on a laptop and just clicked the "I'm not a robot" button.
@editorkid I clicked that on my mac and then had to enter 15561 except that it was barely legible.
@editorkid Me too, I wish it was always that easy!
That error allowed me to sneak into the market and squish my way up in line screaming at the butcher that I want my meat and somehow got my meat before people who were waiting for 40 or so minutes longer then me when I strolled in at 12:40 i did wait that hour or so when you guys did that deploy thang. I did appreciate you guys learning something new with this fuku and appreciate the transparency of the situation
MEH being open is why I love your company. Real problems happen, putting a towel over a pile of crap in the corner still stinks. Putting it out in the open let's nature do it's thing. Your openness lets everyone learn just how fucking hard it is to do something that looks so easy. Very cool, and my hats off to the rad programmers that had to try and keep fuku 4 from being a multiple year bag of fun. Selling one item every time Pluto orbits the sun. GREAT JOB TO ALL OF YOU!!!!!!!!!!
I tried for an 1 hour and 1/2 because I kept track on Twitter that you were working on it...fell asleep..woke up from said sleep at 230a JUST to try again..sold out. The Report was great but all the comments on it were so entertaining that I don't mind the missed sleep. One question, do they sell a Rosetta Stone for the language being spoken in the reports and comments? I wouldn't meh that!
OK so this explains why when I F5’d at midnight like a diligent and devoted meh East Coaster, immediately got to the order screen, successfully proved I was not a robot, got stuck in green bar hell until my order timed-out and didn’t go through? And why I spent the next 2+ hours dealing with my alleged invalid password, failing Google and Amazon logins, “something went terribly wrong” messages, multiple devices and browser sessions, only to foolishly rejoice when I once again got to the order page to be confronted with unintelligible Captcha gibberish and then find out the fuku had sold out without procuring one? Maybe the site should be renamed Oy!?
I love purple.
Very interesting. Glad to get that perspective. I was there at 12:00 the second after it turned over but decided to read the thing before clicking buy - my first Fukubukuro, I didn't know they went quickly. By the time I tried to buy it hung up. Refreshed several times in the next hour or so to try again but fell asleep. Woke up around 2 but it was too late.
Next time I'll know to click buy the second the page loads :)
@daveJay The question will be can you buy the second it loads enough to get at the buy button or do you have to wait until the page finishes loading. Click the meh button before it is done loading and if you want it to stay clicked you have to come back later to do it again (learned that the hard way several times).
@daveJay Clicking the buy button the second the page loads sadly didn't work for me (see above).
@Kidsandliz I hit F5 for 2 hours & finally got one (not one of my most productive evenings), but I forgot to hit the meh button later (I think, who knows after that) and another streak lost... Oh well.
@dfunk29 I didn't know clicking the meh button even if you buy was a thing. I only click meh if I don't want it
@heartny bummer. I guess we have to go through the rest of the steps lightning fast as well
I tried and tried to get it, to no avail.. Boohoo. Kept getting the something has gone terribly wrong...
Thanks for the report, I'm slowly getting over hitting F5 that many times & having to sign in each time for checkout. I thought I was going crazy for a while... But then it finally connected, after 2 long hours. I'm eagerly awaiting my first fukubukuro & I know it's gotta be better than that last box full of poo I got from Whotever.
Thanks for the update! It's nice to see that when there's a problem, some companies don't just try to hide it. They take accountability.
Interesting read, but I'm still sad. Thanks for the info!
@sugarpike cheer up sucker
@shawn 福을 マ マ 포장 뒤에 물류를 보여 줘라!
This was fascinating. Thank you. Keep up the tech reports.
You guys are, once again, awesome. Glad you had the right TP to wipe out this OhShit.
What can I say I only stayed up for 2 hrs trying to buy a fuku. Fuck it!
Is there a support group for all of us who become frantic to buy a product that is Meh. Saying fuk u? Masochistic much? And yet I'll squeal like a little girl when I get one. It will be mine. Oh yes, it will be mine!
Great explanation. It's much appreciated. Sorry about what I said last night...and a few minutes ago. Fuku!
Yes... look forward to Fukubukuro 5: The Kobayashi Maru Experience. Live long and suck it!
Please add 1-click buy button like Amazon. This will make checkout easy and fast. It will simplify your event loop and other code. It will encourage new customer to sign up for an account before the next Fukubukuro.
Another idea is try a Fukubukuro lottery and email special coupon codes or custom web links to lucky winners. This will match the spirit of a bag of blessings or treasures. It will reduce bot-buying without user-unfriendly Captcha. It will also encourage new customer to sign up for account and join email list before the next lottery.
Thanks for posting this! As a programmer who is doing trial by fire (every day starts by thinking "I don't even know what I'm doing"), this is all very interesting and might help me with future problems (that I likely created in the first place).
And it's kind of nice to know that I somehow slipped into that sub-3 minute window, after an hour of trying and right as I was about to give up. :)
Adding an index to speed up searches alleviated the symptoms, but it did not fix the root cause. The real problem is user requests arriving rapidly at random time, interrupted each other, so none of the requests ran to completing. Then, the requests timed out, started over, and repeated the vicious cycle. Eventually, all CPU and memory bandwidth became consumed by event switching without getting any real work done.
@mediocrelab Sounds exactly like a mom with five kids.
To turn this chaos into order, please change your sever code to process one transaction at a time. Finish one request before it services the next request. Switch tasks only when it is waiting for the next user input.
For example: Let's assume you have 5 servers processing credit card payments and obtaining each credit card approval takes 1 second on average. Then, 5 severs can process 5 requests per second, or 300 requests per minute.
This will simplify your code and make it more scalable. Instead of 5 expensive high-end severs, you may buy 20 cheap low-end servers to process 4 times more requests, or 1200 requests per minute.
@mediocrelab I am also not a developer, but I think you need to stare at the charts more. this kind of volume requires asynchronous messaging (e.g. not sequential process)
@mediocrelab sounds like you totally know how to fix everything. You should apply to work here.
@mediocrelab Designing robust software for bursty high-Internet-demand financial transactions is much harder than you think it is.
@mediocrelab Thanks for the tech breakdown in the OhShit report! It is refreshing to do business with a business that is welcomingly transparent. For those of us (myself included) that were frustrated at getting shut out of the Fuku4 despite attempting to buy it since shortly after 12am EST when it came out, I think there were 2 main sources of frustration: 1) The inefficient code that caused many of us who tried to order shortly after the event started to be unable to complete our orders; and the subsequent fix that wasn’t completed until almost 2 hours later. This is the main issue that you've addressed in your ohshit report. HOWEVER, there is also a secondary issue that frustrated those of us who were "in queue" early and yet got shut out of being able to buy the Fuku and that's the second source of frustration, and something I have not yet seen addressed substantively in any comments from Meh: 2) That those of us who were in the queue early enough to have been able to buy it before it sold out were later shut out of being able to buy because your fixes to the inefficient code enabled customers who came along LATER on in the queue to be able to buy it BEFORE those of us who had already figuratively "been standing in line for almost 2 hours". That part of the customer frustration has yet to be addressed (aside from a few snarky comments here and there about not complaining about such a Meh product, etc.) But you can see how the "fixes" you guys came up with to free up the various bottlenecks that night essentially left people who had been standing in line waiting (trying) to check out at the "store" left behind while people who came into the store just a couple seconds prior to the fix were able to check out in front of us. The brick and mortar store equivalent of line jumping. And if that had happened in an actual brick and mortar store people probably would’ve resorted to downright fist fights and violence (especially if it was an Apple product they’d been standing in line for 2 hours waiting to buy). My comments here are not meant to be a complaint, but rather more about trying to raise awareness to a possibly overlooked source of customer disappointment and suggest a reasonable alternative in the future that may help alleviate the "line-jumping" that has often taken place at these computer-related snafus. So should any inefficient code or unforeseen product-purchasing bottlenecks take place in the future, surely you all can find a way to honor the orders attempts IN THE ORDER THEY CAME IN, rather than allowing a bunch of late-comers to snag up product that other people were already in line waiting to buy while the store fixed its various problems. And yes, I understand how time-consuming an endeavor like that can possibly be to undertake – especially in a system that’s set up to handle things in batches – but considering this (hopefully) won’t happen all that often (much like an Apple product launch event) – then surely you guys can put forth the effort to work out a system to make sure that purchase attempts get honored in at least approximately the order they came in – even if that means having to do a few things manually, and even if that means having to set the website to show that the product is sold out to make sure that it doesn’t accept any new orders until you’re sure there is stock left from fulfilling all the purchase attempts that were sidelined by your technical difficulties. BUT, (fingers crossed), hopefully this is a moot issue and you guys will never again have server problems, code inefficiencies, purchase bottlenecks, etc. But if you do, please consider a more chronological approach to fulfilling orders once the problem is fixed rather than a line-rush.
@pepsiwine i think you meant to reply to @shawn. @mediocrelab, despite the name, is not a flask.
So there's nothing wrong my password that took 3 weeks to memorize and I still usually fuckup and can't remember w/o 4 trys?
He forgot to mention when Fuku 5 was scheduled...
I WANT MY FUKIBAZUKITHINGY, DAMNIT!!
Or at least some sushi.. you owe me that much.
And what's this about buying meat? I missed the meat sale. Is meat.meh.com coming soon?
@shawn can you elaborate at all on your troubleshooting process - how did you tracked down each of those bottlenecks? Looking at the graph gives you a good place to start, but after that how did you drill down?
this was a fascinating situation - being a dev myself, we're usually only on the backend of these things. to see the user result on the front (i got one finally!) and then read this follow up report on the backend was great!
any further insight would be great - and thanks for being so meh.
I really appreciate the ohshit reports. Thanks for the hard work.
I just wanted to say:
1) Thanks. This proves your awesomeness. Ok, really, not all by itself, but it's consistent with all of the other awesome-proving evidence.
2) Thanks for turning off notifications for a while. I'm just going to take that as a personal favor to the goat. So, yeah, thanks.
3) If I said anything that hurt anyone's feelings, well, fuck, I'm sorry they're so goddamned sensitive. Ha. ;-) Oh, I don't even know. No, really, I'm essentially clueless. I am probably sorry though. Let me know what it was. I hear rumors. Rumors tell me nothing.
Awesome details, thanks for the report. Did I understand it at all? Nope.com However, I now do understand one thing very much. Know what it is? It was truly amazing that I got in during that less than 3 minutes before the sell out in order to purchase my very first Fuku after spending almost 2 hours trying. So, I'm awesome. Thanks for the validation of my awesomeness. I breathlessly await my new bag.
Bath.
I love Purple.
@debraae I love purple!
@Barney YOU love purple too?!? #twinsie
As others have said, I have no idea what the details of all that meant but I really appreciate the transparency. @shawn I think your doing a great job.
The Internet reminds me of the automotive industry. 100 years ago, to have a car, you had to be a mechanic. as cars became more advanced, the number of things the user could fix went steadily downhill. Now, nobody fixes anything… The blackbox tells you what part needs to be replaced. As the world becomes more connected, more and more of us are more and more dependent on a smaller and smaller percentage of really smart guys who keep everything running.
Today it was a lucky grocery sack that made a couple of people stay up late trying to fix things… but at some point, it will be Joshua trying to play global thermonuclear warfare.
@saodell Not guys. Folks. People. Coders can be girls, remember. Teach your daughters and sons!
@saodell Just show Joshua how to play Tic Tac Toe and things will be fine.
@shawn Most excellent.
I like the explain and the quick fixing of problem n is clearly a result of really good people, even if it only exposes problem n+1.
One request tho: please have someone sing the next OHSHIT REPORT, perhaps to the terribly wrong song melody.
@baqui63 i nominate @Moose for singing the next report
@katylava I second @Moose 's nomination!
@baqui63 I third @Moose's nomination.
@katylava @baqui63 @Thumperchick I don't like this. I can give a reading but no singing.
@Moose I'll settle for a dramatic reading. Would that be acceptable @katylava @baqui63 @Thumperchick?
Works for me, @dashbutt ... @Moose ? @katylava ? @Thumperchick ?
@baqui63 @dashcloud @Thumperchick i can vouch for @Moose's 5-star dramatic readings
@katylava @baqui63 @dashcloud @moose sounds good to me.
@baqui63 How about a rap? That could potentially break the interwebs.
Even though I'm still depressed that I kept on refreshing for more than an hour and still didn't get a Fuku, this real-time bug patching is really amazing.
I worked a bit for PayPal in its earlier years... and it took PayPal up to a week to fix major loopholes and such even before it got purchased by eBay. (especially considering the sensitivity of money-related services)
Good job @shawn
@on
just fix it so it works, said the guy that thought he had a fuku bag, but turns out it the order didn't go through.
thank you
@somf69 "What he said," said the guy whose order was processing, processing, processing, processing, processing, processing, processing, processing and then timed out and everything went to hell.
Why not just limit the number of orders in intervals throughout the day? You could set it up so only 25% of stock becomes available for purchase in the first 8 hours, 50% in the next 8 hours, and 25% in the last 8.
@Kishin That would be unfair to those of us on the east coast.
@OldCatLady I was thinking it would be more fair. People feel it's unfair when they wake up to see something they wanted to purchase sold out. Setting a cap on orders throughout various times in the day gives people a second chance to get something they want.
@OldCatLady it's strange to me how only the east coast complains about how "unfair" everything is, considering deals are posted based on your timezone. I vote deals go up Midnight Central time. That wil definitely be more fair.l
@giroro Other time zones complain about "fairness" - they just don't complain about rollover time.
@giroro That's me told, then.
Reading this gave me a nerd-boner. Troubleshooting is what I live for...the details were great.
Awesome explanation!
This is like basic econ: If you've got a good paying job that you like, then the economy is fantastic. If you lost your job then the economy sucks.
Or, more specifically, if you got in on the Fuku, it went well. If you didn't, then, well, you're pissed.
(And I'm not pissed.)
Personally, as someone who runs a few "cloud" datacenters with some pretty complex websites, I get it completely.
And I appreciate the explanation - (especially?) including the level of detail. That would never happen at, say, woot!
@shawn Thanks for the write-up; this was excellent! I've been coding .NET for years, but have yet to jump into Azure very much. Any qualms with it? The company I work for is just starting to take a hard look at it.
I like purple.
@Barney fuck yeah yah do.
@shawn Do you guys stay at the office on nights you sell the fuku or do you tele commute ?
You guys did get it fixed quick for what problems you had. I commend you on it. No code is perfect code but least you got a fix out asap.
@Skylord123 google hangout
As someone who works in IT in a large company, I am damn jealous at how fast you were able to deploy this code fix. We'd be stuck adding the fix to a sprint, coding it, getting it verified by QA, then UAT, and THEN a month later, production
@yankeesrule And heaven forbid you have to audit the process for SOX purposes...
@yankeesrule
@yankeesrule When you have a small shop, the UAT is by your customers. ;-) Thankfully they're both patient (mostly) and foregiving (mostly).
@yankeesrule why the month between qa/uat and prod? i workfor a large behemoth, and we put fixes in sprints, but as product i do my uat during the sprint. things go into prod days later, mostly because we tend to release on the same day each week.
@vampje I mean't a month of UAT. Didn't write that really clearly.
I slept through it, so it's my own damn fault I didn't get Fuku #4. But, thanks for the ohshit report anyway. It was an interesting, but totally over my head, read.
I always forget I have this funny little extension in Chrome.
@chiphead cloud to butt is the best!
@chiphead Hopefully it doesn't do it with substrings because, well, I would find it unflattering to say the least.
@chiphead @JonT That sounds like fun. But I would probably do another one of these. And once was enough. https://meh.com/forum/topics/wtf-per-se
@Buttscout
@joelmw That's some awesome goat work there, putting one of your greatest hits out there like that.
@JonT @joelmw @Cloudscout @dashcloud My post was about this, and I still seriously thought for a few minutes that two of you had the usernames of @Buttscout and @dashbutt
@chiphead That's pretty amusing.
@dashcloud "No, I don't know Your Monkey Has His Balls In My Beer, but if you hum a few bars, I can fake it."
@joelmw That's awesome!
@JonT Just looking at the screenshot do you have the power to edit everyone's comments?
JonT edit: I totally do
I'm not sure weather to be amused or afraid...
@dashcloud I was pleased to find it. :-)