How I almost lost my backups

My backup strategy follows the 3-2-1 backup rule:

  • Three copies of your data
  • On two different media
  • One copy off-site

Some people argue that your production data (the copy stored on your laptop, for example) counts as one of those copies. I disagree. While it may be a copy of your current data, it's not a copy of your historical data. If I need to recover a file that I modified or deleted 2 years ago, my laptop won't have it.

Device backups

I use Restic to write encrypted backups to a 2.5" SSD inside an external USB enclosure. This disk also contains a few scripts to automate the process. Friday is backup day: connect the disk, run the script, done.

Using rsync, I make copies of this master disk:

  • To another USB 2.5" hard drive located at the office. This is a physical copy I can easily access if my house burns down. Restic's encryption is vital here, as coworkers have access to the area where this disk is kept.

  • To a dedicated server at Hetzner. A few friends have SSH access1 to the account that stores this copy. This provides another "worst-case scenario" path to my data.

Every few months I run a validation script on my Hetzner server. This executes Restic with the check --read-data subcommand on the copy stored there. A few days later, I know whether the data is still correct. While not exactly the same as a full restore test, it gives me some piece of mind.

Additional backup for documents

Important folders on my desktop and laptop are synced using Syncthing. In addition to my personal devices, these also replicate to one of my servers. From there, an hourly Restic snapshot is made to Backblaze B2. In case my 3-2-1 drives fail, I have one last chance to recovery my important documents here.

Post mortem

I was assured my backup scheme was solid, until it wasn't.

Suddenly, the data validation on the Hetzner server failed.2 Restic saves backups as deduplicated chunks in what it calls "pack files." Several of these files no longer matched their checksums. Since pack files are named after their SHA-256 hash, I checked them again manually. Sadly, sha256sum confirmed the worst: the data in these files was not what it was supposed to be.

This server has two 4TB disks in RAID 1. I assumed there was a hardware issue with the server, but everything looked fine. The other services on the server ran fine, the disks reported no errors, and the RAID was healthy.

Then I realized: the Hetzner server was not the source of corruption. Checking the pack files on my USB disk quickly revealed the truth. These files were also corrupt, and rsync had dutifully replicated the corrupted data to my server.

It quickly became obvious what was happening. During additional checks on my USB disk, my laptop started reporting I/O errors. The SSD in this enclosure has been failing silently for a while, returning random data instead of my files. I hadn't noticed because I rarely use these backups. The data is usually only read by rsync to push it to the server.

An hour later, the SSD died completely.

Two days later, when I was back at the office, I checked the second USB drive. I had little hope. I usually sync both the Hetzner server and this disk at the same time. As expected, the same errors were present.

I had managed to corrupt all three copies of my backups simultaneously.

Recovery

There is no magic recovery procedure here. If all copies of a piece of data are gone, they are gone forever. Regrettably, I did lose some data that day.

However, Restic was quite helpful during the recovery process:

  • A nice side-effect of the data deduplication is that you can "import" files again. By creating a temporary snapshot with files gathered from other locations (like old laptop folders or emails), Restic recreates the deduplicated chunks. This makes those files usable again in the older, previously "corrupt" snapshots!

  • Dry-running restic repair snapshots --forget listed exactly which files would be removed due to missing data. Use this as a to-do list to gather missing files.

  • Part of the missing data turned out to be logfiles I didn't care about. Purging these from snapshots made it a lot easier to identify which important files were actually missing.3

In the end, I lost only three possibly important files: pictures. Since the data is gone, I don't know exactly what was in them, but I know which albums they belonged to. My backup disk now contains a text file listing those three missing filenames as a permanent reminder.

Preventing this in the future

Clearly, I needed safeguards. I made two specific changes to the script I use to sync backups from my local USB SSD to the other destinations:

  1. Before executing rsync, the script now runs restic check. While a full --read-data takes too long for a weekly routine, a basic check catches structural errors. Before my previous SSD died, the basic check also found errors. I'm not sure it would have catched the SSD returning small bits of random data, though.

  2. Except for the locks directory, restic never overwrites files. It only creates new ones. Old files stay until you run restic forget to clean up your repository. I now run rsync with the --ignore-existing flag, as there is no reason to overwrite remote files.

I'm confident that this would prevent data loss if a drive fails silently again.

Once hard drive prices stabilize, I'll likely buy another external drive for a yearly "cold" archive. The hardest files to replace where those that have been removed from my devices years ago. Having an additional offline drive with an old copy would allow me to recover those files.

  1. Not all of them know about it though.

  2. I would have never discovered about this if it wasn't for this regular data check. Backups are not backups if you don't check them.

  3. Figuring out what data I had lost took a few days and I didn't really take notes. Documenting which commands I used here would have been the most interesting part of the article. Oh well..