Borgbase backups have been unavailable for 3 days

smcleod · on Aug 13, 2023

Their current ETA for service restoration is another 2-4 days.

Despite the warning banner the uptime for customers on us10 instance which has been unavailable for almost a week now is showing at 100%.

They are compensating affected customers by crediting 4 weeks subscription but I must say this does make me wonder about their redundancy and recovery architecture.

Opshugs to the folks at Borgbase.

--

> box-us10 offline for storage expansion

> We are currently experiencing a temporary outage on our box-us10 server due to an unplanned expansion process. We added two new hard drives to the server in an effort to enhance capacity and accommodate future growth. However, we did not anticipate that the expansion process would require the server storage to be temporarily offline.

> To prioritize the safety and integrity of your existing data, we have decided to keep the box-us10 server offline until the expansion is successfully completed. It's estimated that about a week of downtime will be needed for the expansion. We understand the inconvenience this may cause and sincerely apologize for any disruption to your services.

> Current progress: 26% (Sunday morning UTC) > Date Created: 2023-08-11 16:17:36 (3 days ago) > Last Updated: 2023-08-13 18:15:53 (15 hours ago)

x0x0 · on Aug 14, 2023

> me wonder about their redundancy and recovery architecture

You can pretty much guarantee it's untested and unlikely to work, when someone doesn't understand how drive arrays work, or test expanding drive arrays before doing it in prod...

jm4 · on Aug 14, 2023

They messed up and either ran out of disk space or had an array fail. Everything in that statement reads like they are trying to frame a critical production failure as a glitch that occurred during maintenance. This wasn’t about a routine storage expansion. It was an emergency caused by some operational screwup and the only way to fix it is to rebuild the array with new disks.

jskrablin · on Aug 14, 2023

I had a raid6 array going almost bad when 2 HDDs called it a day at the same time years ago. Had to send the entire office of 30 people to unscheduled holiday for two days before the array was rebuilt with new drives. Third drive failing during the rebuild would likely result in a week off as the storage would need a full rebuild.

I can imagine something similar happening here, however there should be a fail over storage setup in case they're selling a service.

boring_twenties · on Aug 14, 2023

In a storage business no less. They literally had one job: to know how to resize arrays, so the rest of us don't have to.

viraptor · on Aug 14, 2023

They messed up here in some way, but assuming you know what events and complaining about their skills is just making stuff up at this point. Unless you can tell us exactly what happened, maybe hold off with that kind of criticism.

colechristensen · on Aug 14, 2023

> However, we did not anticipate that the expansion process would require the server storage to be temporarily offline.

One, it is obviously some flavor of raid rebalancing, there’s not a different kind of thing it could be.

Two, they’re admitting they didn’t expect this, that’s not possible if it’s a tested procedure.

viraptor · on Aug 14, 2023

Possible, but that's still guessing. Also possible: extra load while adding storage caused multiple drive failures, so suddenly they're without spares and want to finish the migration as soon as possible, so they stopped the customer traffic.

"did not anticipate" can be interpreted in different ways too. It may be "didn't know it could happen", or just "estimated it to be so unlikely that emergency downtime was acceptable as a response".

I've seen outside speculations as an inside person before and they're often very likely to be BS. Let's stick to facts and skip the "obviously" and "not possible" until we learn more. What's happening right now may be their prepared and tested procedure. (Or it may not)

x0x0 · on Aug 14, 2023

And three, they're admitting they don't back anything up (backup defined as not just slinging bytes somewhere but the availability to restore those bytes and bring a service back up on a timeline that is certainly well lower than 3 days. With regular testing to understand the timeline and ensure everything actually works.)

ps -- it's not about being bashing these folks. But these type of multiple operational incompetencies strongly call into question their ability to safeguard your bytes and produce them when you need them. It's kinda the whole point of the thing.

wmf · on Aug 14, 2023

It's a backup company. I can understand that they don't back up the backups.

justinclift · on Aug 14, 2023

If there's one thing you'd expect from "a backup company", it'd be that they're really good at doing backups. :)

That does include their own of course.

colechristensen · on Aug 14, 2023

It’s a backup company so I would expect the store 1 copy plus the minimum redundancy to make drive failures a predictable amount of unlikely to cause data loss. More than that would be additional features.

mananaysiempre · on Aug 14, 2023

By the way, is this a plausible scenario?

(1) Add more drives, start online rebuild

(2) Have one drive fail/URE mid-rebuild

(3) Be forced to switch to offline rebuild because the array was in a strange not-entirely-consistent state even before the failure, even though without the new drives you’d be in a routine online-rebuild situation

If so, possibility (2) could in principle not be detectable in staging, as it could depend on the drive’s age, and having to reread everything—as a rebuild does—is a rather abnormal load. The failure could even predate (1), if it happened to data nobody’d looked at for a long time.

This is less a question about these guys and more in general—is a RAID/ZFS/etc. array/pool/etc. in a vulnerable state after you’ve expanded it?

wtallis · on Aug 15, 2023

> and having to reread everything—as a rebuild does—is a rather abnormal load.

Far from being an abnormal load, it should be a cron job. Backups need to be verified. Data at rest needs to be periodically checked for bitrot. Reading everything on the array and verifying parity and any checksums is important preventive maintenance and an early warning system.

mananaysiempre · on Aug 14, 2023

That a decision needs to be made under uncertainty doesn’t relieve one from the necessity of making the decision. That implies estimating an uncertain state of the world and drawing inferences from that, and whether those are framed as “criticism” is irrelevant.

In situations where you could plausibly get more data, it’s usually OK to admonish people to just go and get some, because whatever the difficulty is doing that, judging the current degree of uncertaintly is probably even harder. In situations where you couldn’t, though (trade or state secrets), there’s nothing to do but to use what data you do have and point out your uncertainty carefully.

(This is a very general argument because what you’ve said is a fully general counterargument to any criticism of anyone from anybody but an insider. As for this particular case, GP’s allegation of lacking ops knowledge seems to be on the charitable side to me: an uncharitable one, as stated elsethread, would be that they had one drive too many fail on them, whether due to incompetence or misfortune, and are forced to do an offline rebuild while choosing to lie about it. It’s the lie part that I’d have the greatest problem with, if it’s actually there.)

TylerE · on Aug 14, 2023

This is why I ultimately trust my personal data to Apple (and to a lesser and smaller extent, Google). I don't expect either to exactly have my best interest at heart - but I do expect both of them to have basic ops competency.

adobrawy · on Aug 14, 2023

Large corporations have the technical skills in place, but organizationally they can lock your account away leaving you with no access and no real avenue to evict your data. There are many posts on HN and on the Internet where corporate support had no procedure to give back access to data when they blocked the account. Account bans are sometimes for unclear or out of control reasons (e.g. someone stole your credit card and then used that card with a another account, so the corporate banned all accounts indefinitely). In the case of a smaller organization, you have a better chance that if they have to say goodbye to you, they will do it without unnecessary harm to you.

aborsy · on Aug 14, 2023

That would also be a major security concern: people gaining access to accounts by social engineering .

tjoff · on Aug 14, 2023

Well, you do at least encrypt your backups yourself, right?

aborsy · on Aug 14, 2023

That doesn’t prevent the deletion of backups and snapshots, encrypting the encrypted backups for a ransom, and using your account for distribution of the illegal content or malware.

tjoff · on Aug 14, 2023

Sure, but lets try and be a little more realistic and weigh the scenarios accordingly.

crossroadsguy · on Aug 14, 2023

Actually if I must entrust my data to companies like these where reliability and competence is of utmost importance - I’d rather trust Google and Facebook (much much before Apple).

If we are talking about privacy (or rather privacy theatre) then yeah Apple, why not.

chillfox · on Aug 14, 2023

I pulled down my Google Drive data some time ago with rclone and a bunch of the older files were corrupted.

istjohn · on Aug 14, 2023

If you have multiple backups, you don't have to trust anyone.

greenie_beans · on Aug 14, 2023

except yourself !

ksec · on Aug 14, 2023

I wish Apple made a Time Capsule that requires iCloud 2TB Subscription to work.

They could have priced it at $299 to include the initial 12 months iCloud subscription.

Wowfunhappy · on Aug 14, 2023

Sorry, can you expand on that? Why require the sub?

ksec · on Aug 14, 2023

Old Time Capsule isn't really a backup. You either need two HDD with BTRFS or ZFS set up as minimum. And then You will need an offsite backup option. iCloud , Google storage etc.

If you only go via the iCloud route, you have zero local copy of your data. If you only only go via local, it isn't safe enough option. So ending up with a combination of the two.

And Apple really wanted services revenue, if it was only just a Time Capsule I doubt Apple would ever make it again.

dikei · on Aug 14, 2023

Yeah, I wouldn't trust their process at all, if they'd rather have the server down for 1 week instead of aborting and performing a rollback to the last known good state.

x0x0 · on Aug 14, 2023

I suspect they can't. Abort and rollback, that is. Evidence: it's not rational to eat a week of downtime if you have an alternative.

edit: Which tells you something about their backups, and it ain't great. Or they'd restore onto another box.

shrubble · on Aug 14, 2023

Are they using big disks and having to resilver or re-do their RAID setup across the added drives, requiring everything on the disks to be read and re-written?

colechristensen · on Aug 14, 2023

It would appear to be something like that.

> However, we did not anticipate that the expansion process would require the server storage to be temporarily offline.

This is embarrassing. Somebody made a big mistake of the “mistakes like this shouldn’t be made” category. Like not an accident or fat finger or a bug, but engineers not understanding fundamentals of how a system worked and having none of the processes in place to to dry runs in non production or any of the other things that would protect against something like this. Really questionable to trust an organization that makes a mistake like this.

js4ever · on Aug 14, 2023

It's not an org but a one man show. I'm using borgbase myself, really happy with it until now and I have to admit I'm now reconsidering this choice...

jskrablin · on Aug 14, 2023

One man show has a bus factor of 1. Which is kinda bad for anything critical...

rsync · on Aug 14, 2023

"This is embarrassing. Somebody made a big mistake of the “mistakes like this shouldn’t be made” category."

You have no idea what happened.

Protecting against "something like this" often introduces complexity and failure cascades that are much more harmful than a simple system being offline.

In fact, I would consider it a feature if uptime was deliberately sacrificed for simplicity and data integrity.

I wish Manu the very best and will consider it not a failure but a success if this array/subsystem/whatever comes through without data loss.

CamperBob2 · on Aug 14, 2023

>unplanned expansion process

Sounds like an Elon Musk euphemism after a bad day at SpaceX.

bloopernova · on Aug 14, 2023

Someone added disk to an array without understanding what would happen next.

trallnag · on Aug 14, 2023

What happens next? Shouldn't the array continue working, albeit with significantly degraded performance?

bloopernova · on Aug 14, 2023

Sorry, I didn't quite answer your question!

Yes, the expected behaviour would be for the disks to be initialized. Beyond that, it'd be a configuration setting for what to do with new disks. It could be used to extend the existing LUNs, or added as a new pair in a RAID 10 style setup, or added as new members of a different level RAID.

Then it would be up to the sysadmin to extend the LUN, or divide the newly added space among multiple LUNs.

Different arrays have different behaviour with new disk. Some only add to offline LUNs. Some do everything online with lower performance, as you said. Even some super expensive enterprise kit will be very easy to break by adding new disk.

trallnag · on Aug 14, 2023

Thanks for the insightful answer!

bloopernova · on Aug 14, 2023

Well, in my sysadmin days we would have test/staging equipment that we'd make sure we knew what would happen when a change was made.

But I don't know the details of what really happened, so I can't make a judgement really. If I could edit my previous comment I would, to be less emphatic that someone did something "wrong".

It could have been that this particular disk array had different firmware or was faulty. All we know is that something unexpected happened.

muppetman · on Aug 14, 2023

I've loved BorgBase up until this point. I mean at least they emailed me quickly to tell me, but now I don't have backups. Well that's not true, I backup to a number of locations and BorgBase is just one of them. But it's a chink of my redundancy lost and a huge outage like this for a week makes me wonder what I'm paying for. What if BorgBase WAS my only backup solution and I needed to roll back?

Everyone makes mistakes and I appreciate they were very upfront about it, but yea. Questioning if I'll be an ongoing customer now.

countula · on Aug 14, 2023

TIL Borgbase. The UI looks great for snapshot-oriented backups, and it's good to see them fund borgbackup development.

rsync.net does not support append-only mode, despite advertising it: https://news.ycombinator.com/item?id=32756653

Unavailability is occasionally expected. Don't rely on a single provider to be your sole backup.

Any one have any experience with Hetzner Storage Boxes for this?

rsync · on Aug 14, 2023

"rsync.net does not support append-only mode, despite advertising it: https://news.ycombinator.com/item?id=32756653"

This is correct - we believed we had a working recipe that properly sandboxed the 'rclone mount' and 'rclone serve' directives but we couldn't make it work in a way that allowed us to sleep at night.

But now that is all changing...

It appears that we can run rclone serve restic --stdio in a way that doesn't actually serve anything or create sockets, etc., and as soon as we finish our tests we will have it in place and people can lock their accounts with something like this:

restrict,command="rclone serve restic --stdio --append-only backups/my-restic-repo" ssh-rsa ...

... in authorized_keys and achieve proper append-only mode at long last.

Stand by ...

btschaegg · on Aug 14, 2023

As a customer, that is great to hear (well, read). I was reading about how a restore scenario with Borg would look like if "append only" really matters (so in a ransomware scenario), and what I've seen makes me rather uncomfortable [1]. I don't want to have to do that in a stressful situation.

Is there a particular channel where you intend to post your findings? :)

[1]: https://borgbackup.readthedocs.io/en/stable/usage/notes.html...

ryan29 · on Aug 14, 2023

Borgbase is pretty good. I’ve used it for 6 years.

The usage graph on each repo is great. I redid my server this year and was accidentally backing up a stale snapshot every night for one of my repos. It would succeed, but without changes. The flat line for usage tipped me off. I’m not sure I would have noticed otherwise since it was a VM image that I could restore and boot, but it wasn’t super obvious it was full of stale data.

This downtime isn’t ideal, but, if downtime is needed to preserve data integrity, I’ll take it over the risk of trying to maintain availability.

exitheone · on Aug 14, 2023

Been using hetzner storage boxes for borg backup for >5 years. Works without issue for multiple TB of backups. Every maintenance downtime so far has been announced in advance.

AnonHP · on Aug 14, 2023

The main reason I’ve avoided Hetzner storage box is because they don’t accept loading money into the account as a prepaid credit for a longer period. It’s always a month-by-month payment. Ok, so they do allow it, but only through a bank transfer, which is either not easy or possible for those outside the EU. If they allowed this through credit/debit cards or PayPal or some other mechanism, I’d surely try it. Hetzner in general seems to be very conservative in handling (and avoiding) risks.

I’d like to prepay for a year or longer for critical services so that in case something goes wrong (card expired, didn’t notice reminder emails, temporarily off the grid, temporarily incapacitated), things just don’t go poof.

wiredfool · on Aug 14, 2023

I’ve got an auto credit card setup on one of my hetzner accounts.

rvieira · on Aug 14, 2023

I've tried Borg but I settled with Kopia and a[0] cheap S3-like storage.

Everything is automated and runs on a schedule, but when I need some ad-hoc restore, say, of a single snapshot or folder I can always use KopiaUI.

Been quite happy with it.

[0] - actually two providers, for redundancy, plus local physical storage

wiredfool · on Aug 14, 2023

I’ve had a couple of hetzner storage boxes- generally they’re ok.

Though just this weekend there was some screwiness with sub accounts and ssh keys not working that did resolve itself after 24-48 hours. I didn’t notice anything communicated about it.

lifeofguenter · on Aug 14, 2023

Hetzner Storage just lost a couple of snapshots a while ago…

If you want “perfect architecture” and reliability/availability you just need to pay for it (aws/s3). Or make use of multiple providers.

noAnswer · on Aug 14, 2023

> If you want “perfect architecture” and reliability/availability you just need to pay for it (aws/s3).

I find that really funny. AWS has outages regularly. Whole data centers who go dark. Remember the S3 outage because an admin fat-fingerly deleted a large number of servers? ...no, people forget very quickly because of the "No one was ever fired for using AWS" mantra.

kiririn · on Aug 14, 2023

A vague description or white lie might have been better here - a storage company with a fleet of servers shouldn’t ever be in this situation for the reason given

smcleod · on Aug 14, 2023

Personally I appreciate the honesty and look for it in vendors, there are too many that cover up the truth (MS/Github etc...).

However I agree that it doesn't look good for a storage company.

TheNewsIsHere · on Aug 14, 2023

I agree with you.

I’d like further transparency on this: why didn’t they anticipate that the array would be offline for this expansion? Expanding a storage environment can certainly be a mine-filled exercise for even the most experienced. But why was this expansion unexpectedly an offline operation?

I’m not throwing shade here. I’m genuinely curious because I’ve been considering Borg and Borgbase for an upcoming project.

SteveNuts · on Aug 14, 2023

I've been in this exact situation before in a storage expansion.

With enough 22TB spinning drives nowadays you can get in to a scenario where, with the new data being written into the array and the expansion process going on at the same time the rebuild essentially will never complete. This is especially true with lower end CPU servers and without dedicated RAID cards.

It is dangerous because the rebuild stresses the drive which makes a failure more likely, and a failure during the rebuild process is not great, on top of that you've got a months long ETA...

No fun

noAnswer · on Aug 14, 2023

> I've been in this exact situation before in a storage expansion.

Can you tell us what solution that was? Something with ZFS or BTRFS? My experience with classic RAID system is, you can't expand them without reinitializing them. (But that comes with obvious warnings about data loss.)

jnsaff2 · on Aug 14, 2023

This may not be the answer you are looking for but one option is Ceph.

Several cheap, low-power NAS boxes running Linux/Ceph and throw disk at them.

Depending on the data you store you can have 3-way replication of Erasure Coding at your preferred risk/performance level.

Disk failures are painless, box failures don't lose data, expansion and re-balancing scales with the number of disks and pretty quickly the limiting factor is your network (4-5 modern spinning disks can easily saturate a 10G network link).

wtallis · on Aug 14, 2023

Traditional hardware RAID requires you to start over from scratch to expand an array. ZFS has some limited ability to expand an array, but there are still a lot of scenarios where the recommended solution is to start over from scratch. BTRFS has full flexibility for online resizing of arrays and changing between RAID modes and rebalancing the array, but its parity RAID modes aren't really trustworthy enough yet for a backup provider to be relying on.

SteveNuts · on Aug 14, 2023

Not always true, Dell PERC allows online expansion (you can even change raid levels on the fly in certain circumstances) but the controller switches from buffering writes to write-through cache policy during the operation which further slows things down.

kiwijamo · on Aug 14, 2023

I've used borg with rsync.net for several years for my primary off-site backup and have had fantastic reliability and support from them. They cover the usage of borg on their website[1] and support a range of other backup tools as well (including even just simply copying data over SFTP and relying on automatic ZFS snapshots on their end). I pay something in the region of $100 USD a year for this service.

My secondary backup is borg as well, this one hosted on a Hetzner Storage Box[2]. Having seen some bad practice in this thread, I want to make it clear that I am not duplicating the rsync instance to Hetzner but have created a separate instance entirely to avoid issues with the primary borg instance being replicated. From what I understand this is best practice (although it does mean I have to run the backup process twice--once to rsync.net and then again to Hetzner). Hetzner also provides info[3] on using their service with borg. I pay around 3.50 EUR a month for this service.

I chose rsync.net due to their CEO being active on HN and also because they're based in North America (away from me in the Pacific). Hetzner , I chose because they are based in Europe giving me both operational diversity (operated by two different unrelated companies) and locational diversity (Western US for rsync.net and Northern Europe for Hetzner. Hetzner sharp pricing also helped as well.

I'm sure I looked at BorgBase at one point for my EU-based backup location but something must have put me in Hetzner's direction instead.

I also store photos (not my general backup, for now) in AWS S3 (largely because that makes it easier to deploy my static photo album site which it itself deployed via S3+CloudFront). I use S3 standard storage (for the time being) for a bucket in Sydney (nearest to my location) and then a glacier storage in Ireland for a bucket which is automatically replicated from my Sydney bucket.

Hopefully all this is enough redundancy! But all up I only spend ~ USD20 or so a month for this peace of mind.

[1]: https://www.rsync.net/products/borg.html [2]: https://www.hetzner.com/storage/storage-box [3]: https://community.hetzner.com/tutorials/install-and-configur...

nieve · on Aug 14, 2023

For me everything about rsync.net was great except for the throughput. It didn't matter which continent, isp, or operating system I tested, I couldn't get past single digit Mbps and sometimes had trouble reaching that. Support tried moving me to another server, but the problem persisted. Other than that I was pretty happy, but it was completely infeasible to store PostgreSQL backups there, much less server & laptop backups. Every now and then I consider going back in case they've fixed the root problem.

HumanOstrich · on Aug 14, 2023

I'm confused because none of this seems relevant to the comment thread. Thank you for the information anyway.

trallnag · on Aug 14, 2023

So a few dozen Gigabytes?

seba_dos1 · on Aug 14, 2023

> I’d like further transparency on this: why didn’t they anticipate that the array would be offline for this expansion?

People who can give solid explanations are probably busy right now, and those who aren't can only write what you already seen.

sparrish · on Aug 14, 2023

And it's a practical guarantee that something like this will never happen again. The wise man learns from his mistakes.

mkoryak · on Aug 14, 2023

That man was not very wise to begin with. I wonder what else he needs to learn about _in prod_.

bartvk · on Aug 14, 2023

A white lie? Like what? How would you word it?

You mean to downplay what actually happened?

ddtaylor · on Aug 14, 2023

I guess now is a fun time to describe my cheap backup process. All of this is related to Linux and the backups are fully encrypted client-side. Under the hood `borgbackup` is used, but I use the PikaBackup app (see Flathub) to make it easier to interface with.

First I have a local copy of my data living on my actual hard drive.

Next I use PikaBackup (via borg) to encrypt and sync that to a cloud server I run that has about 200GB of storage for these backups for about $1/mo added cost.

Next I use the Backblaze Cloud tool to synchronize those encrypted backups to B2 using `b2 sync --delete` command. It runs automatically via cron every night. The costs here are about $0.01 per month since I only get charged what I actually use.

Backups are pruned and I can easily control the schedule, copies, etc. If I need to recover a file it mounts as a file system that I can easily navigate using any tools I want, including the command line. I can also mount older or different snapshots.

willbush · on Aug 14, 2023

I do something similar, but with restic. I wrote about it here: https://willbush.dev/blog/synology-nas-backup/ if anyone is interested.

som · on Aug 14, 2023

I've a similar local / S3 (compat) strategy but use Kopia. I currently only backup remotely to B2 (risk taker!), but could easily and cheaply add redundancy in here .. e.g S3 / R2 / random hosting. It's very cheap ~US$1. UI and default strategies are perfect for what's required. Completely automatic unless I need to browse some files in which case I just open the Kopia UI. I haven't used it, but understand Kopia also supports rsync.

Edit: previous comment on same (not a shill, I swear!): https://news.ycombinator.com/item?id=34152369

rexreed · on Aug 14, 2023

Which cloud server do you use for your backups for the $1/mo in added costs?

dikei · on Aug 14, 2023

Similarly, for my Linux devices I use `borg` to backup to a local NAS, and use `kopia` to have another backup to Google Cloud Storage, as B2 too slow where I live.

I used to have back up on local external drives too, but stopped doing that, since the process was manual and I often forgot about it.

chanandler_bong · on Aug 14, 2023

The fact that as a customer that is impacted by this, I only found out about it when my backups and automated test restore failed is worrying.

I get that things happen, but not realising that this was going to be a service impacting event does not inspire continued confidence. Sure, this situation probably won't happen again, but what else don't they understand about their infrastructure?

jnsaff2 · on Aug 14, 2023

It looks to me that their US10 is not like an AZ but an actual server with bunch of HBA and disks. So very much pets and not only in a single location but possibly in a single rack or even box.

You are (maybe) protected against a few disk failures but that's about it.

This FAQ entry seems to confirm this: https://docs.borgbase.com/faq/#which-storage-backend-are-you...

ryan29 · on Aug 14, 2023

I got an email about it right away and I’ve also been getting warnings (that I configured in the dashboard) for inactivity on a repo that’s affected.

kiwicopple · on Aug 25, 2023

(off topic) I promised to reply to you in another thread, but I can't because replies are locked. feel free to reach out to me if you want - my contact is in my profile

exhilaration · on Aug 14, 2023

> automated test restore failed

You automatically test restore? That makes sense but I've never heard of that before, can you describe the process?

semi-extrinsic · on Aug 14, 2023

Not OP, but I would guess it's something like this:

  1. Make a e.g. 30MB file of random data  
  2. Copy it to "_reference" file  
  3. Upload the file to backup service  
  4. Restore the file from backup service  
  5. Diff restored file against reference

chanandler_bong · on Aug 14, 2023

Pretty simple, really.

Pick a couple random files that should be in the repo, restore them from a random archive, check the md5sums against the source. If the md5sums don't match (or the file can't be found), something is wrong. I am mainly backing up RAW image files, so they should never change.

Basically...

$TEST_FILE=$(ls -p /source_dir | grep -v / | shuf -n1)

$TEST_ARCHIVE=$(borgmatic -c config.file list | shuf -n1)

borgmatic extract yada yada yada

md5sum $TEST_FILE restored_file

adobrawy · on Aug 14, 2023

I don't use borg, but I used duplicity, which offers something like that. The verify operation simulates a backup restore to compare whether the restored file's checksum matches that expected from the metadata and optionally against the local file. I use this routinely, interesting to see that a local S3 provider can sometimes mess up your files silently.

alex3305 · on Aug 14, 2023

I used their trial for a bit to test it out with Vorta [1] in a container. Vorta (and Borg) seemed to work fine, until I wanted to restore an archive and I noticed that my recent snapshots were completely empty. Probably because of a misconfiguration on my end though. But it made me look elsewhere. For me backups should be a fire, test and forget solution.

Recently I made the switch to Kopia [2] which seems to have feature parity with Borg (and Restic [3]). It also has a web UI which is way easier to work with than Vorta. And I can easily view, extract and restore individual files or folders from there. This gave me way more confidence about this solution. The only thing I really miss is that I cannot chose different targets for different paths. For instance, with Borg I was able to backup a partial of my Docker appdata to an external source. And I haven't found a way to do this with Kopia. Besides that I'm pretty happy with this solution and I would recommend it.

1. https://vorta.borgbase.com/

2. https://kopia.io/

3. https://restic.net/

bloopernova · on Aug 14, 2023

My borgbase backups have been great, although according to my dashboard I'm on a different server instance.

I don't envy their sysadmin team right now. Good luck, folks.

bomewish · on Aug 14, 2023

isn't it one guy??