OHSHIT Report — InfluxDB crashed and took three days of visitor stats with it

6

When something goes terribly wrong here at Mediocre Laboratories we draft up internal memos called "OHSHIT REPORTS". We use OHSHIT REPORTS to help share details across the company about what went wrong, coordinate work across teams for how we're going to fix it, and dive deep into root causes.

Part 1: What Happened...

The bad news comes first. All of the visitor stats for April 30, 2015, as well as some of the stats for April 29 and most of the stats for May 1, are gone. Vanished. Lost to time and space, known only to the alternate universes in which our statistics database didn't crash.

Background

As previously mentioned here, we use InfluxDB to store and add up the counts of each and every visit we get here. After upgrading to version 0.8.3¹ on April 29, InfluxDB crashed at 11 PM CDT and stayed down through the night, causing loss of visitor data. After the site was brought back up and the original issue fixed on April 30, we upgraded InfluxDB to version 0.8.8. On May 1 at 5:30 PM, InfluxDB crashed again, taking all the visitor records we had stored with it.

We were, fortunately, able to restore to a backup we made while version 0.7.3 was installed at the cost of losing all the data we had successfully collected from the backup's creation until we restored it.

There's some vital bits missing from this picture. Namely, all our visitor stats.

Visible effects

For those of you who do go back and look through our deal history, you'll notice that some things are off. For April 30, visitor-related stats are completely gone. As for April 29 and May 1, there's stats, but they're not quite right. There's a relatively small piece missing on the 29th, but the 1st only has about two hours worth of data, which means the "didn't bite" percentage and visitor counts are both much too low, and the referral statistics are only valid for referrals during that period.

In case you weren't sure, this graph shouldn't look like this.

How does something like this even happen?

We'll go into more detail in part 2, but the problem boils down to resource usage. InfluxDB 0.8 retrieves data much more quickly than InfluxDB 0.7. However, we did not anticipate the level at which the new version of Influx consumed resources and were caught unprepared when, much to our chagrin, it ran out of memory. The third crash corrupted our data, effectively destroying all of it. The third crash corrupted our data, effectively destroying all of it.

We wanted to get data quickly. We quickly wound up with no data. Such is the cost of hubris.

Part 2: ...And Why

If you're not interested in the intense technical details, you should stop reading here and scroll down to the comments. However, if things like "ulimit" and "lock file" mean anything to you, trudge on, dear reader.

Failure 1: Too many open files

The installation guide for InfluxDB 0.8 recommends either removing the open files limit in Linux or setting it to 65536. In fact, Influx 0.8.8 does this on every single run of the init script, no matter what you're doing:

$ sudo service influxdb stop
Setting ulimit -n 65536
influxdb process was stopped [ OK ]

Influx 0.8.3, however, does not, and the migration guide from 0.7 to 0.8, which we were working from, excludes this recommendation. We use InfluxDB's UDP interface along with our home-grown node module to send visitor data rapid-fire to InfluxDB with minimal overhead for the site. Normally, this is great, however, the new, sharded method InfluxDB uses for writes in 0.8 opened files past our limit during the surge of visits at the start of the April 30 event, which, at the time, was 1024. This produced hundreds of thousands of "Too many open files" errors returned by ShardDatastore, ClusterConfiguration, RaftServer, and CoordinatorImpl objects continually until InfluxDB was restarted.

Failure 2: Continuous Query Corruption

The pageview graph that appears behind our stats is powered by a continuous query that sums visits every hour and stores the result so that we can render a graph for every deal we've ever posted. After the failure mentioned in the previous section was resolved, we took a snapshot of Influx and upgraded from 0.8.3 to 0.8.8. When we restarted, however, all of our continuous query since the time of crash read "0," giving us an empty pageview graph. The underlying visitor data, however, was still there.

A lone peak

Since continuous queries are never recalcuated, we decided the best way to attempt to recover this data would be to drop the continuous query and recreate it, a solution we have used in the past. After doing this, however, we saw a massive spike in our disk access, which prevented us from logging into InfluxDB's virtual machine, as well as causing another stats outage. After allowing the process to run for about half an hour with no apparent progress, the decision was made to reboot the virtual machine.

This kind of spike (1.12 MB/s) is what we in the industry refer to as "bananas."

Once again, after this, all continuous query points read "0" after the time at which the first failure occurred, even though a new one was generated before the reboot. As new continuous query data was being saved every hour past that, we allowed the instance to continue in this state to avoid the further destruction of data.

The query itself did not cause the disk spike, but rather, the database shards genereated by the query. Creating a continuous query similar to this:

select count(id) from visitors group by time(1h) fill(0) into visitors.count.1h

results in InfluxDB running this query:

select count(id) from visitors where ((time < 1430424000000000000) AND (time > -6795364578871345152)) group by time(1h) fill(0)

...which proceeded, with our configuration, to create a shard for every single 1-week time span since October 7, 1909. The surge of disk writes brought down our server.

Failure 3: Out of memory, complete data loss

From there, things proceeded smoothly for about a day. Then, this happened:

kernel: [91953.441012] influxdb invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0

...ending in this verdict:

kernel: [91953.444396] Killed process 1170 (influxdb) total-vm:2205864kB, anon-rss:1465164kB, file-rss:0kB

There's no corresponding log in influxdb, other than that 8 queries were all started within 4 seconds of each other approximately 4 minutes before the kernel reported the out of memory error. When we restarted InfluxDB, none of our data was anywhere to be found. The shard files were still present, but the data was corrupt and rendered inaccessible.

New data, however, continued to write without issue. Even after restarts and reboots, we were able to access all data written after the out of memory crash, but we could never query the data that was written before.

After several hours of attempting to recover it, we made the decision to restore to a backup we made before upgrading to 0.8.3, as we felt we could not trust the stability of InfluxDB 0.8.x.

Not a failure, but interesting

Ever since a machine boot that was performed immediately before the upgrade to 0.8.3, we began seeing this entry in the InfluxDB logs on every start of the service (including the final 0.7.3 startup):

[WARN] (wal.(*log).check:70) /opt/influxdb/shared/data/wal/log.49197653 was truncated to 3099900 since the file has a zero size request

This isn't the first time a similar error appeared in our logs. However, previous occurences did not show up on every startup, nor did they have the same large number consistently every time (in fact, it was frequently 0).

Conclusion

Given the data observed above and the data we lost to get here, we will no longer be using InfluxDB 0.8.x. The problems and headache are simply not worth the performance improvements. Likewise, when InfluxDB 0.9.0 is eventually released, we will be spending a long time vetting it, making sure it can run stably for an extended period with our entire record history.

We like being fun, but we don't mess around when it comes to our data. The statistical void will remain, but know that it is not something we take pride in.

¹Since the 0.8 series uses a different format from the 0.7 series, this was a prerequisite to installing 0.8.8, which does not have migration code.