Better, Faster, More Secure Backups With Restic
8 minute read
It’s been a while since I’ve written on this blog and I am resolving (pre New Year’s!) to write more. I’m starting easier by writing about a backup setup that I put in place while I’ve been on paternity leave. A few highlights of my current setup:
- encrypted at rest + nice threat model
- incremental snapshots
- deduplication of data
- written in Golang (++ style points)
- support for tons of backends
- ⛅️ multi-cloud ⛅️
Let’s just get it out of the way: if you’re not backing your files etc. to some cloud provider or at least a vcs like Git, just stop reading and do that right now. This blog’s readership will likely skew towards the technically-inclined for who this likely isn’t a problem. But just in case - make sure your stuff is backed up somewhere besides your computer and hopefully not just on an external drive.
Anyways. If you’re like me, you probably use Google Drive, Box, Dropbox, Backblaze or even just an S3 bucket to keep a copy of your important digital files. And if you’re also like me, you have a system that works and rarely think about improving it or messing with it once you’ve established it and spent somet time fine-tuning it. See, a few years ago I consolidated the files (papers from college, repos, etc.) that I had spread across different computers and different storage providers into one to keep things in view and simple. This has worked well for me and I haven’t had any issues with my setup. I aimed for:
- something that runs in the background so my system isn’t vigilance-based
- durable outside my own hardware
- in at least two clouds
- low cost if possible
- safe at rest
I achieved this in part with a combination of Google drive sync on my laptop and a monthly zip dump in S3. This has worked pretty well. Haven’t lost any files and feeling pretty good both about my storage bill and the integrity of my files. There are some cons to this approach, but it worked well enough for my usecase.
So why do anything differently? For one, I’ve had some more time late at night to tinker and so that’s what I do. But I’d also read some stories about people whose stuff had been removed from GDrive due to policies on what people were storing. To be fair, I don’t have anything that would qualify for this situation (at least that I know of), but the notion that my “safest place” for my files wasn’t necessarily safe bothered me and if a better way came along I’d happily embrace it. I also don’t love the idea of all my personal files getting indexed to squeeze ad dollars out of me.
I forget whether it was Twitter, Hacker News, a podcast (Go Time, maybe? they had the creator on to discuss and it was a great episode), or somewhere else where I first heard about Restic. Simply put, Restic is:
…a backup program which allows saving multiple revisions of files and directories in an encrypted repository stored on different backends.
A couple things about Restic stood out to me. For one, it was written in Go and I was really curious about Go at the time (I still am!). But what really caught my eye was the combination of a strong threat model/encryption story with support for many different backends. This meant I could more comfortably store my files in a storage provider that wasn’t Google and have my files encrypted in a way I was comfortable with. It was also at least more likely that I’d be insulated from a provider-initiated purge/removal etc.
After some reading and looking into Restic, I was able to determine that my previous criteria (run in the background, safe at rest, multi-cloud, etc.) would be satisfied and I’d likely see improvements in at least a few of them (cost, insulation from the provider, etc.). Restic’s docs mention, ease of use, speed, verifiability, security and efficiency as goals - all of which sounded great to me!
I was even more hooked after listening to the Go Time episode with Alexander Neumann (creator of Restic). The episode is really worth a listen. Alexander talks about some of his goals for Restic, why he created it, and some interesting aspects of Restic’s design. In fact, the design document is worth reading from a technical perspective since it’s full of good tidbits and more detailed info about Restic’s design if you’re interested. I won’t go into much detail on that front since the docs are so good, but a few quick things:
Restic repos have the following structure:
├── config ├── data │ ├── 00 │ ├── 01 │ ├── 0a │ ├── 0b │ ├── 0c │ ├── 0d │ ├── 0e │ ├── 0f │ └── 1a | ... ├── index ├── keys │ └── 12345678103958105980198 ├── locks └── snapshots
blobs are stored in packs (in the
snapshots are what you’d think they’d be
locksdirectory is used for creating locks on the repo when updates are occurring
One thing to note: the
data dir contains deduplicated, encrypted data and not all your raw files. This means that if an attacker got access to my S3 bucket, they’d have a bunch of garbage data that they couldn’t do anything with (except maybe get the size of all my backups). This also means that a cloud provider would have less ability (in theory) to infer what you’re storing. Cool!
Restic & ⛅️ setup
Check out the installation page and get the binary for your particular platform. Once you’ve got that, you’ll need to pick a backend. I use S3 since I’m really familiar with AWS and S3 costs can be kept reallllly low (more on that below). You can choose from a ton of their backends though and don’t need to pick S3. Most providers have a similar setup if they’re blob-storage oriented (like Google Cloud Platform or Azure). You can choose from (some are via rclone):
- local drive
- Amazon Drive
- Amazon S3
- Backblaze B2
- DigitalOcean Spaces
- Google Cloud Storage
- Google Drive
- IBM COS S3
- Memset Memstore
- Microsoft Azure Blob Storage
- Microsoft OneDrive
- Openstack Swift
- Oracle Cloud Storage
- Rackspace Cloud Files
- Yandex Disk
- The local filesystem
If you can’t find an option there…I can’t help you 🤷
I’m using S3, so I created a bucket (
my-backup-bucket) with object versioning, default encryption turned on (because why not), and made sure it wasn’t public. I also created an IAM user with access to the bucket and grabbed the keys (if you’re familiar with AWS, you’ve probably done this many times).
I initialized my repo by running:
$ export AWS_ACCESS_KEY_ID=<MY_ACCESS_KEY> $ export AWS_SECRET_ACCESS_KEY=<MY_SECRET_ACCESS_KEY> $ restic -r s3:s3.amazonaws.com/my-cool-bucket init
Because I wanted something that was highly redundant, I set up bucket replication so that my primary repo is replciated to another bucket that in turn also has it’s own replication to another bucket configured. Each replica is set to transition the storage class to infrequent access and then to glacier in the final replica. The primary bucket is configured to use S3 intelligent tiering to help keep costs down intra-bucket. This way, I have my data in multiple zones (redundancy) and keep cost reasonable, all without active management on my part.
local -> S3 bucket -> (bucket replciation) -> other S3 bucket in different region, lower bucket cost -> (bucket replciation) -> other S3 bucket with glacier storage class
But that’s not enough (it is though), so I also have a monthly cron that will mirror the S3 repo to another cloud provider. However, I noticed that this was costing a decent amount in AWS because of the bandwidth cost of ~150GB going out from S3. I avoided this by setting up a repo in a different cloud provider and just aliasing it locally and running it less often (
As a last piece, I initialized a repo on a physical drive and back that up every once in a while and keep that drive in a secure location. Paranoid probably but at least I’m likely to have my data in the event of an internet outage.
We’re missing something though: backups without having to remember to run a command. No problem thanks to
crontab. On my mac I ran
crontab -e to edit the crontab file and added:
0 * * * * /bin/sh scripts/backup.sh
…which runs my backup script…
#!/bin/zsh restic -r s3:https://s3.amazonaws.com/bucket/my-cool-bucket --verbose backup ./
note: I exclude some things like nodemodules, Library, .docker directories since those are both HUGE and not something I care to back up
And so we get hourly backups! Restic will deduplicate data thanks to it’s use of content-defined chunking, so new backups on similar or mostly unchanged data don’t take up nearly as much space as they might with other solutions. This blog post and this repo have more info on how content-defined chunking in Restic is implemented if you’re curious.
So, with Restic in place I’ve achieved:
- fast backups (thanks to Go & restic’s design!)
- mutli-AZ and multi cloud
- deduplicated data
- same lack of active maintenance
- encrypted at rest
- able to keep a physical copy in a secure location as a last resort
I’m pretty happy with this so far and hope you try it out, too! Get started at https://restic.net