OHSHIT Report — InfluxDB crashed and took three days of visitor stats with it
6When something goes terribly wrong here at Mediocre Laboratories we draft up internal memos called "OHSHIT REPORTS". We use OHSHIT REPORTS to help share details across the company about what went wrong, coordinate work across teams for how we're going to fix it, and dive deep into root causes.
Part 1: What Happened...
The bad news comes first. All of the visitor stats for April 30, 2015, as well as some of the stats for April 29 and most of the stats for May 1, are gone. Vanished. Lost to time and space, known only to the alternate universes in which our statistics database didn't crash.
Background
As previously mentioned here, we use InfluxDB to store and add up the counts of each and every visit we get here. After upgrading to version 0.8.3¹ on April 29, InfluxDB crashed at 11 PM CDT and stayed down through the night, causing loss of visitor data. After the site was brought back up and the original issue fixed on April 30, we upgraded InfluxDB to version 0.8.8. On May 1 at 5:30 PM, InfluxDB crashed again, taking all the visitor records we had stored with it.
We were, fortunately, able to restore to a backup we made while version 0.7.3 was installed at the cost of losing all the data we had successfully collected from the backup's creation until we restored it.
Visible effects
For those of you who do go back and look through our deal history, you'll notice that some things are off. For April 30, visitor-related stats are completely gone. As for April 29 and May 1, there's stats, but they're not quite right. There's a relatively small piece missing on the 29th, but the 1st only has about two hours worth of data, which means the "didn't bite" percentage and visitor counts are both much too low, and the referral statistics are only valid for referrals during that period.
How does something like this even happen?
We'll go into more detail in part 2, but the problem boils down to resource usage. InfluxDB 0.8 retrieves data much more quickly than InfluxDB 0.7. However, we did not anticipate the level at which the new version of Influx consumed resources and were caught unprepared when, much to our chagrin, it ran out of memory. The third crash corrupted our data, effectively destroying all of it. The third crash corrupted our data, effectively destroying all of it.
We wanted to get data quickly. We quickly wound up with no data. Such is the cost of hubris.
Part 2: ...And Why
If you're not interested in the intense technical details, you should stop reading here and scroll down to the comments. However, if things like "ulimit" and "lock file" mean anything to you, trudge on, dear reader.
Failure 1: Too many open files
The installation guide for InfluxDB 0.8 recommends either removing the open files limit in Linux or setting it to 65536. In fact, Influx 0.8.8 does this on every single run of the init script, no matter what you're doing:
$ sudo service influxdb stop
Setting ulimit -n 65536
influxdb process was stopped [ OK ]
Influx 0.8.3, however, does not, and the migration guide from 0.7 to 0.8, which we were working from, excludes this recommendation. We use InfluxDB's UDP interface along with our home-grown node module to send visitor data rapid-fire to InfluxDB with minimal overhead for the site. Normally, this is great, however, the new, sharded method InfluxDB uses for writes in 0.8 opened files past our limit during the surge of visits at the start of the April 30 event, which, at the time, was 1024. This produced hundreds of thousands of "Too many open files" errors returned by ShardDatastore, ClusterConfiguration, RaftServer, and CoordinatorImpl objects continually until InfluxDB was restarted.
Failure 2: Continuous Query Corruption
The pageview graph that appears behind our stats is powered by a continuous query that sums visits every hour and stores the result so that we can render a graph for every deal we've ever posted. After the failure mentioned in the previous section was resolved, we took a snapshot of Influx and upgraded from 0.8.3 to 0.8.8. When we restarted, however, all of our continuous query since the time of crash read "0," giving us an empty pageview graph. The underlying visitor data, however, was still there.
Since continuous queries are never recalcuated, we decided the best way to attempt to recover this data would be to drop the continuous query and recreate it, a solution we have used in the past. After doing this, however, we saw a massive spike in our disk access, which prevented us from logging into InfluxDB's virtual machine, as well as causing another stats outage. After allowing the process to run for about half an hour with no apparent progress, the decision was made to reboot the virtual machine.
Once again, after this, all continuous query points read "0" after the time at which the first failure occurred, even though a new one was generated before the reboot. As new continuous query data was being saved every hour past that, we allowed the instance to continue in this state to avoid the further destruction of data.
The query itself did not cause the disk spike, but rather, the database shards genereated by the query. Creating a continuous query similar to this:
select count(id) from visitors group by time(1h) fill(0) into visitors.count.1h
results in InfluxDB running this query:
select count(id) from visitors where ((time < 1430424000000000000) AND (time > -6795364578871345152)) group by time(1h) fill(0)
...which proceeded, with our configuration, to create a shard for every single 1-week time span since October 7, 1909. The surge of disk writes brought down our server.
Failure 3: Out of memory, complete data loss
From there, things proceeded smoothly for about a day. Then, this happened:
kernel: [91953.441012] influxdb invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
...ending in this verdict:
kernel: [91953.444396] Killed process 1170 (influxdb) total-vm:2205864kB, anon-rss:1465164kB, file-rss:0kB
There's no corresponding log in influxdb, other than that 8 queries were all started within 4 seconds of each other approximately 4 minutes before the kernel reported the out of memory error. When we restarted InfluxDB, none of our data was anywhere to be found. The shard files were still present, but the data was corrupt and rendered inaccessible.
New data, however, continued to write without issue. Even after restarts and reboots, we were able to access all data written after the out of memory crash, but we could never query the data that was written before.
After several hours of attempting to recover it, we made the decision to restore to a backup we made before upgrading to 0.8.3, as we felt we could not trust the stability of InfluxDB 0.8.x.
Not a failure, but interesting
Ever since a machine boot that was performed immediately before the upgrade to 0.8.3, we began seeing this entry in the InfluxDB logs on every start of the service (including the final 0.7.3 startup):
[WARN] (wal.(*log).check:70) /opt/influxdb/shared/data/wal/log.49197653 was truncated to 3099900 since the file has a zero size request
This isn't the first time a similar error appeared in our logs. However, previous occurences did not show up on every startup, nor did they have the same large number consistently every time (in fact, it was frequently 0).
Conclusion
Given the data observed above and the data we lost to get here, we will no longer be using InfluxDB 0.8.x. The problems and headache are simply not worth the performance improvements. Likewise, when InfluxDB 0.9.0 is eventually released, we will be spending a long time vetting it, making sure it can run stably for an extended period with our entire record history.
We like being fun, but we don't mess around when it comes to our data. The statistical void will remain, but know that it is not something we take pride in.
¹Since the 0.8 series uses a different format from the 0.7 series, this was a prerequisite to installing 0.8.8, which does not have migration code.
- 10 comments, 3 replies
- Comment
So what time of day was all of this maintenance performed? Which poor sap got the call and had to go fix it? Who made the call to the poor sap? So many questions...
@Willijs3 @harrison caught it while still at work, but at the end of the day. So he and I had a late Friday night in the office, pork chops were cold, many tears.
Thanks for the update. But this was "meh" speak to me. Transparency is important for those who understand it. Which I am sure there are many here.
I'm glad to have read/scanned Part 2.
We would never make an upgrade like that without it being upgraded in sandbox, DEV, SIT, and UAT first. The amount of testing that is done before a production server is upgraded is sort of crazy so we never encounter these kinds of issues.
So what happens due to the testing, we are always several versions behind on everything. For example, we just upgraded all our servers to use MSSQL Server 2008 R2 and there has already been 2012 and 2014 released since then.
We never trust the manufacturer's updates until we put it through integration testing ourselves.
I am sorry for your loss, but a very good read. Thank you.
Thanks for the report! Sorry for the data loss, and the time spent tracking down dead-ends.
I admit it. I'm a geek. A web programmer geek. I get unnaturally excited when I see an OHSHIT report post. Like, new Vacation Movie Trailer excited.
This is everything a failure postmortem report should be.
"The third crash corrupted our data, effectively destroying all of it. The third crash corrupted our data, effectively destroying all of it."
@harrison Meta joke? :D
@snapster for the sake of appearing clever and not just bad at proofreading, I'm going to say yes
So, if I read this right, and I read it until my brain froze up, this basically says you lost count for 2 days. Would it help if we posted that we visited the site? Do a voluntary count?
1.
@Teripie 2!