Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Borgbase backups have been unavailable for 3 days (borgbase.com)
105 points by smcleod on Aug 13, 2023 | hide | past | favorite | 85 comments


Their current ETA for service restoration is another 2-4 days.

Despite the warning banner the uptime for customers on us10 instance which has been unavailable for almost a week now is showing at 100%.

They are compensating affected customers by crediting 4 weeks subscription but I must say this does make me wonder about their redundancy and recovery architecture.

Opshugs to the folks at Borgbase.

--

> box-us10 offline for storage expansion

> We are currently experiencing a temporary outage on our box-us10 server due to an unplanned expansion process. We added two new hard drives to the server in an effort to enhance capacity and accommodate future growth. However, we did not anticipate that the expansion process would require the server storage to be temporarily offline.

> To prioritize the safety and integrity of your existing data, we have decided to keep the box-us10 server offline until the expansion is successfully completed. It's estimated that about a week of downtime will be needed for the expansion. We understand the inconvenience this may cause and sincerely apologize for any disruption to your services.

> Current progress: 26% (Sunday morning UTC) > Date Created: 2023-08-11 16:17:36 (3 days ago) > Last Updated: 2023-08-13 18:15:53 (15 hours ago)


> me wonder about their redundancy and recovery architecture

You can pretty much guarantee it's untested and unlikely to work, when someone doesn't understand how drive arrays work, or test expanding drive arrays before doing it in prod...


They messed up and either ran out of disk space or had an array fail. Everything in that statement reads like they are trying to frame a critical production failure as a glitch that occurred during maintenance. This wasn’t about a routine storage expansion. It was an emergency caused by some operational screwup and the only way to fix it is to rebuild the array with new disks.


I had a raid6 array going almost bad when 2 HDDs called it a day at the same time years ago. Had to send the entire office of 30 people to unscheduled holiday for two days before the array was rebuilt with new drives. Third drive failing during the rebuild would likely result in a week off as the storage would need a full rebuild.

I can imagine something similar happening here, however there should be a fail over storage setup in case they're selling a service.


In a storage business no less. They literally had one job: to know how to resize arrays, so the rest of us don't have to.


They messed up here in some way, but assuming you know what events and complaining about their skills is just making stuff up at this point. Unless you can tell us exactly what happened, maybe hold off with that kind of criticism.


> However, we did not anticipate that the expansion process would require the server storage to be temporarily offline.

One, it is obviously some flavor of raid rebalancing, there’s not a different kind of thing it could be.

Two, they’re admitting they didn’t expect this, that’s not possible if it’s a tested procedure.


Possible, but that's still guessing. Also possible: extra load while adding storage caused multiple drive failures, so suddenly they're without spares and want to finish the migration as soon as possible, so they stopped the customer traffic.

"did not anticipate" can be interpreted in different ways too. It may be "didn't know it could happen", or just "estimated it to be so unlikely that emergency downtime was acceptable as a response".

I've seen outside speculations as an inside person before and they're often very likely to be BS. Let's stick to facts and skip the "obviously" and "not possible" until we learn more. What's happening right now may be their prepared and tested procedure. (Or it may not)


And three, they're admitting they don't back anything up (backup defined as not just slinging bytes somewhere but the availability to restore those bytes and bring a service back up on a timeline that is certainly well lower than 3 days. With regular testing to understand the timeline and ensure everything actually works.)

ps -- it's not about being bashing these folks. But these type of multiple operational incompetencies strongly call into question their ability to safeguard your bytes and produce them when you need them. It's kinda the whole point of the thing.


It's a backup company. I can understand that they don't back up the backups.


If there's one thing you'd expect from "a backup company", it'd be that they're really good at doing backups. :)

That does include their own of course.


It’s a backup company so I would expect the store 1 copy plus the minimum redundancy to make drive failures a predictable amount of unlikely to cause data loss. More than that would be additional features.


By the way, is this a plausible scenario?

(1) Add more drives, start online rebuild

(2) Have one drive fail/URE mid-rebuild

(3) Be forced to switch to offline rebuild because the array was in a strange not-entirely-consistent state even before the failure, even though without the new drives you’d be in a routine online-rebuild situation

If so, possibility (2) could in principle not be detectable in staging, as it could depend on the drive’s age, and having to reread everything—as a rebuild does—is a rather abnormal load. The failure could even predate (1), if it happened to data nobody’d looked at for a long time.

This is less a question about these guys and more in general—is a RAID/ZFS/etc. array/pool/etc. in a vulnerable state after you’ve expanded it?


> and having to reread everything—as a rebuild does—is a rather abnormal load.

Far from being an abnormal load, it should be a cron job. Backups need to be verified. Data at rest needs to be periodically checked for bitrot. Reading everything on the array and verifying parity and any checksums is important preventive maintenance and an early warning system.


That a decision needs to be made under uncertainty doesn’t relieve one from the necessity of making the decision. That implies estimating an uncertain state of the world and drawing inferences from that, and whether those are framed as “criticism” is irrelevant.

In situations where you could plausibly get more data, it’s usually OK to admonish people to just go and get some, because whatever the difficulty is doing that, judging the current degree of uncertaintly is probably even harder. In situations where you couldn’t, though (trade or state secrets), there’s nothing to do but to use what data you do have and point out your uncertainty carefully.

(This is a very general argument because what you’ve said is a fully general counterargument to any criticism of anyone from anybody but an insider. As for this particular case, GP’s allegation of lacking ops knowledge seems to be on the charitable side to me: an uncharitable one, as stated elsethread, would be that they had one drive too many fail on them, whether due to incompetence or misfortune, and are forced to do an offline rebuild while choosing to lie about it. It’s the lie part that I’d have the greatest problem with, if it’s actually there.)


This is why I ultimately trust my personal data to Apple (and to a lesser and smaller extent, Google). I don't expect either to exactly have my best interest at heart - but I do expect both of them to have basic ops competency.


Large corporations have the technical skills in place, but organizationally they can lock your account away leaving you with no access and no real avenue to evict your data. There are many posts on HN and on the Internet where corporate support had no procedure to give back access to data when they blocked the account. Account bans are sometimes for unclear or out of control reasons (e.g. someone stole your credit card and then used that card with a another account, so the corporate banned all accounts indefinitely). In the case of a smaller organization, you have a better chance that if they have to say goodbye to you, they will do it without unnecessary harm to you.


That would also be a major security concern: people gaining access to accounts by social engineering .


Well, you do at least encrypt your backups yourself, right?


That doesn’t prevent the deletion of backups and snapshots, encrypting the encrypted backups for a ransom, and using your account for distribution of the illegal content or malware.


Sure, but lets try and be a little more realistic and weigh the scenarios accordingly.


Actually if I must entrust my data to companies like these where reliability and competence is of utmost importance - I’d rather trust Google and Facebook (much much before Apple).

If we are talking about privacy (or rather privacy theatre) then yeah Apple, why not.


I pulled down my Google Drive data some time ago with rclone and a bunch of the older files were corrupted.


If you have multiple backups, you don't have to trust anyone.


except yourself !


I wish Apple made a Time Capsule that requires iCloud 2TB Subscription to work.

They could have priced it at $299 to include the initial 12 months iCloud subscription.


Sorry, can you expand on that? Why require the sub?


Old Time Capsule isn't really a backup. You either need two HDD with BTRFS or ZFS set up as minimum. And then You will need an offsite backup option. iCloud , Google storage etc.

If you only go via the iCloud route, you have zero local copy of your data. If you only only go via local, it isn't safe enough option. So ending up with a combination of the two.

And Apple really wanted services revenue, if it was only just a Time Capsule I doubt Apple would ever make it again.


Yeah, I wouldn't trust their process at all, if they'd rather have the server down for 1 week instead of aborting and performing a rollback to the last known good state.


I suspect they can't. Abort and rollback, that is. Evidence: it's not rational to eat a week of downtime if you have an alternative.

edit: Which tells you something about their backups, and it ain't great. Or they'd restore onto another box.


Are they using big disks and having to resilver or re-do their RAID setup across the added drives, requiring everything on the disks to be read and re-written?


It would appear to be something like that.

> However, we did not anticipate that the expansion process would require the server storage to be temporarily offline.

This is embarrassing. Somebody made a big mistake of the “mistakes like this shouldn’t be made” category. Like not an accident or fat finger or a bug, but engineers not understanding fundamentals of how a system worked and having none of the processes in place to to dry runs in non production or any of the other things that would protect against something like this. Really questionable to trust an organization that makes a mistake like this.


It's not an org but a one man show. I'm using borgbase myself, really happy with it until now and I have to admit I'm now reconsidering this choice...


One man show has a bus factor of 1. Which is kinda bad for anything critical...


"This is embarrassing. Somebody made a big mistake of the “mistakes like this shouldn’t be made” category."

You have no idea what happened.

Protecting against "something like this" often introduces complexity and failure cascades that are much more harmful than a simple system being offline.

In fact, I would consider it a feature if uptime was deliberately sacrificed for simplicity and data integrity.

I wish Manu the very best and will consider it not a failure but a success if this array/subsystem/whatever comes through without data loss.


>unplanned expansion process

Sounds like an Elon Musk euphemism after a bad day at SpaceX.


Someone added disk to an array without understanding what would happen next.


What happens next? Shouldn't the array continue working, albeit with significantly degraded performance?


Sorry, I didn't quite answer your question!

Yes, the expected behaviour would be for the disks to be initialized. Beyond that, it'd be a configuration setting for what to do with new disks. It could be used to extend the existing LUNs, or added as a new pair in a RAID 10 style setup, or added as new members of a different level RAID.

Then it would be up to the sysadmin to extend the LUN, or divide the newly added space among multiple LUNs.

Different arrays have different behaviour with new disk. Some only add to offline LUNs. Some do everything online with lower performance, as you said. Even some super expensive enterprise kit will be very easy to break by adding new disk.


Thanks for the insightful answer!


Well, in my sysadmin days we would have test/staging equipment that we'd make sure we knew what would happen when a change was made.

But I don't know the details of what really happened, so I can't make a judgement really. If I could edit my previous comment I would, to be less emphatic that someone did something "wrong".

It could have been that this particular disk array had different firmware or was faulty. All we know is that something unexpected happened.


I've loved BorgBase up until this point. I mean at least they emailed me quickly to tell me, but now I don't have backups. Well that's not true, I backup to a number of locations and BorgBase is just one of them. But it's a chink of my redundancy lost and a huge outage like this for a week makes me wonder what I'm paying for. What if BorgBase WAS my only backup solution and I needed to roll back?

Everyone makes mistakes and I appreciate they were very upfront about it, but yea. Questioning if I'll be an ongoing customer now.


TIL Borgbase. The UI looks great for snapshot-oriented backups, and it's good to see them fund borgbackup development.

rsync.net does not support append-only mode, despite advertising it: https://news.ycombinator.com/item?id=32756653

Unavailability is occasionally expected. Don't rely on a single provider to be your sole backup.

Any one have any experience with Hetzner Storage Boxes for this?


"rsync.net does not support append-only mode, despite advertising it: https://news.ycombinator.com/item?id=32756653"

This is correct - we believed we had a working recipe that properly sandboxed the 'rclone mount' and 'rclone serve' directives but we couldn't make it work in a way that allowed us to sleep at night.

But now that is all changing...

It appears that we can run rclone serve restic --stdio in a way that doesn't actually serve anything or create sockets, etc., and as soon as we finish our tests we will have it in place and people can lock their accounts with something like this:

restrict,command="rclone serve restic --stdio --append-only backups/my-restic-repo" ssh-rsa ...

... in authorized_keys and achieve proper append-only mode at long last.

Stand by ...


As a customer, that is great to hear (well, read). I was reading about how a restore scenario with Borg would look like if "append only" really matters (so in a ransomware scenario), and what I've seen makes me rather uncomfortable [1]. I don't want to have to do that in a stressful situation.

Is there a particular channel where you intend to post your findings? :)

[1]: https://borgbackup.readthedocs.io/en/stable/usage/notes.html...


Borgbase is pretty good. I’ve used it for 6 years.

The usage graph on each repo is great. I redid my server this year and was accidentally backing up a stale snapshot every night for one of my repos. It would succeed, but without changes. The flat line for usage tipped me off. I’m not sure I would have noticed otherwise since it was a VM image that I could restore and boot, but it wasn’t super obvious it was full of stale data.

This downtime isn’t ideal, but, if downtime is needed to preserve data integrity, I’ll take it over the risk of trying to maintain availability.


Been using hetzner storage boxes for borg backup for >5 years. Works without issue for multiple TB of backups. Every maintenance downtime so far has been announced in advance.


The main reason I’ve avoided Hetzner storage box is because they don’t accept loading money into the account as a prepaid credit for a longer period. It’s always a month-by-month payment. Ok, so they do allow it, but only through a bank transfer, which is either not easy or possible for those outside the EU. If they allowed this through credit/debit cards or PayPal or some other mechanism, I’d surely try it. Hetzner in general seems to be very conservative in handling (and avoiding) risks.

I’d like to prepay for a year or longer for critical services so that in case something goes wrong (card expired, didn’t notice reminder emails, temporarily off the grid, temporarily incapacitated), things just don’t go poof.


I’ve got an auto credit card setup on one of my hetzner accounts.


I've tried Borg but I settled with Kopia and a[0] cheap S3-like storage.

Everything is automated and runs on a schedule, but when I need some ad-hoc restore, say, of a single snapshot or folder I can always use KopiaUI.

Been quite happy with it.

[0] - actually two providers, for redundancy, plus local physical storage


I’ve had a couple of hetzner storage boxes- generally they’re ok.

Though just this weekend there was some screwiness with sub accounts and ssh keys not working that did resolve itself after 24-48 hours. I didn’t notice anything communicated about it.


Hetzner Storage just lost a couple of snapshots a while ago…

If you want “perfect architecture” and reliability/availability you just need to pay for it (aws/s3). Or make use of multiple providers.


> If you want “perfect architecture” and reliability/availability you just need to pay for it (aws/s3).

I find that really funny. AWS has outages regularly. Whole data centers who go dark. Remember the S3 outage because an admin fat-fingerly deleted a large number of servers? ...no, people forget very quickly because of the "No one was ever fired for using AWS" mantra.


A vague description or white lie might have been better here - a storage company with a fleet of servers shouldn’t ever be in this situation for the reason given


Personally I appreciate the honesty and look for it in vendors, there are too many that cover up the truth (MS/Github etc...).

However I agree that it doesn't look good for a storage company.


I agree with you.

I’d like further transparency on this: why didn’t they anticipate that the array would be offline for this expansion? Expanding a storage environment can certainly be a mine-filled exercise for even the most experienced. But why was this expansion unexpectedly an offline operation?

I’m not throwing shade here. I’m genuinely curious because I’ve been considering Borg and Borgbase for an upcoming project.


I've been in this exact situation before in a storage expansion.

With enough 22TB spinning drives nowadays you can get in to a scenario where, with the new data being written into the array and the expansion process going on at the same time the rebuild essentially will never complete. This is especially true with lower end CPU servers and without dedicated RAID cards.

It is dangerous because the rebuild stresses the drive which makes a failure more likely, and a failure during the rebuild process is not great, on top of that you've got a months long ETA...

No fun


> I've been in this exact situation before in a storage expansion.

Can you tell us what solution that was? Something with ZFS or BTRFS? My experience with classic RAID system is, you can't expand them without reinitializing them. (But that comes with obvious warnings about data loss.)


This may not be the answer you are looking for but one option is Ceph.

Several cheap, low-power NAS boxes running Linux/Ceph and throw disk at them.

Depending on the data you store you can have 3-way replication of Erasure Coding at your preferred risk/performance level.

Disk failures are painless, box failures don't lose data, expansion and re-balancing scales with the number of disks and pretty quickly the limiting factor is your network (4-5 modern spinning disks can easily saturate a 10G network link).


Traditional hardware RAID requires you to start over from scratch to expand an array. ZFS has some limited ability to expand an array, but there are still a lot of scenarios where the recommended solution is to start over from scratch. BTRFS has full flexibility for online resizing of arrays and changing between RAID modes and rebalancing the array, but its parity RAID modes aren't really trustworthy enough yet for a backup provider to be relying on.


Not always true, Dell PERC allows online expansion (you can even change raid levels on the fly in certain circumstances) but the controller switches from buffering writes to write-through cache policy during the operation which further slows things down.


I've used borg with rsync.net for several years for my primary off-site backup and have had fantastic reliability and support from them. They cover the usage of borg on their website[1] and support a range of other backup tools as well (including even just simply copying data over SFTP and relying on automatic ZFS snapshots on their end). I pay something in the region of $100 USD a year for this service.

My secondary backup is borg as well, this one hosted on a Hetzner Storage Box[2]. Having seen some bad practice in this thread, I want to make it clear that I am not duplicating the rsync instance to Hetzner but have created a separate instance entirely to avoid issues with the primary borg instance being replicated. From what I understand this is best practice (although it does mean I have to run the backup process twice--once to rsync.net and then again to Hetzner). Hetzner also provides info[3] on using their service with borg. I pay around 3.50 EUR a month for this service.

I chose rsync.net due to their CEO being active on HN and also because they're based in North America (away from me in the Pacific). Hetzner , I chose because they are based in Europe giving me both operational diversity (operated by two different unrelated companies) and locational diversity (Western US for rsync.net and Northern Europe for Hetzner. Hetzner sharp pricing also helped as well.

I'm sure I looked at BorgBase at one point for my EU-based backup location but something must have put me in Hetzner's direction instead.

I also store photos (not my general backup, for now) in AWS S3 (largely because that makes it easier to deploy my static photo album site which it itself deployed via S3+CloudFront). I use S3 standard storage (for the time being) for a bucket in Sydney (nearest to my location) and then a glacier storage in Ireland for a bucket which is automatically replicated from my Sydney bucket.

Hopefully all this is enough redundancy! But all up I only spend ~ USD20 or so a month for this peace of mind.

[1]: https://www.rsync.net/products/borg.html [2]: https://www.hetzner.com/storage/storage-box [3]: https://community.hetzner.com/tutorials/install-and-configur...


For me everything about rsync.net was great except for the throughput. It didn't matter which continent, isp, or operating system I tested, I couldn't get past single digit Mbps and sometimes had trouble reaching that. Support tried moving me to another server, but the problem persisted. Other than that I was pretty happy, but it was completely infeasible to store PostgreSQL backups there, much less server & laptop backups. Every now and then I consider going back in case they've fixed the root problem.


I'm confused because none of this seems relevant to the comment thread. Thank you for the information anyway.


So a few dozen Gigabytes?


> I’d like further transparency on this: why didn’t they anticipate that the array would be offline for this expansion?

People who can give solid explanations are probably busy right now, and those who aren't can only write what you already seen.


And it's a practical guarantee that something like this will never happen again. The wise man learns from his mistakes.


That man was not very wise to begin with. I wonder what else he needs to learn about _in prod_.


A white lie? Like what? How would you word it?

You mean to downplay what actually happened?


I guess now is a fun time to describe my cheap backup process. All of this is related to Linux and the backups are fully encrypted client-side. Under the hood `borgbackup` is used, but I use the PikaBackup app (see Flathub) to make it easier to interface with.

First I have a local copy of my data living on my actual hard drive.

Next I use PikaBackup (via borg) to encrypt and sync that to a cloud server I run that has about 200GB of storage for these backups for about $1/mo added cost.

Next I use the Backblaze Cloud tool to synchronize those encrypted backups to B2 using `b2 sync --delete` command. It runs automatically via cron every night. The costs here are about $0.01 per month since I only get charged what I actually use.

Backups are pruned and I can easily control the schedule, copies, etc. If I need to recover a file it mounts as a file system that I can easily navigate using any tools I want, including the command line. I can also mount older or different snapshots.


I do something similar, but with restic. I wrote about it here: https://willbush.dev/blog/synology-nas-backup/ if anyone is interested.


I've a similar local / S3 (compat) strategy but use Kopia. I currently only backup remotely to B2 (risk taker!), but could easily and cheaply add redundancy in here .. e.g S3 / R2 / random hosting. It's very cheap ~US$1. UI and default strategies are perfect for what's required. Completely automatic unless I need to browse some files in which case I just open the Kopia UI. I haven't used it, but understand Kopia also supports rsync.

Edit: previous comment on same (not a shill, I swear!): https://news.ycombinator.com/item?id=34152369


Which cloud server do you use for your backups for the $1/mo in added costs?


Similarly, for my Linux devices I use `borg` to backup to a local NAS, and use `kopia` to have another backup to Google Cloud Storage, as B2 too slow where I live.

I used to have back up on local external drives too, but stopped doing that, since the process was manual and I often forgot about it.


The fact that as a customer that is impacted by this, I only found out about it when my backups and automated test restore failed is worrying.

I get that things happen, but not realising that this was going to be a service impacting event does not inspire continued confidence. Sure, this situation probably won't happen again, but what else don't they understand about their infrastructure?


It looks to me that their US10 is not like an AZ but an actual server with bunch of HBA and disks. So very much pets and not only in a single location but possibly in a single rack or even box.

You are (maybe) protected against a few disk failures but that's about it.

This FAQ entry seems to confirm this: https://docs.borgbase.com/faq/#which-storage-backend-are-you...


I got an email about it right away and I’ve also been getting warnings (that I configured in the dashboard) for inactivity on a repo that’s affected.


(off topic) I promised to reply to you in another thread, but I can't because replies are locked. feel free to reach out to me if you want - my contact is in my profile


> automated test restore failed

You automatically test restore? That makes sense but I've never heard of that before, can you describe the process?


Not OP, but I would guess it's something like this:

  1. Make a e.g. 30MB file of random data  
  2. Copy it to "_reference" file  
  3. Upload the file to backup service  
  4. Restore the file from backup service  
  5. Diff restored file against reference


Pretty simple, really.

Pick a couple random files that should be in the repo, restore them from a random archive, check the md5sums against the source. If the md5sums don't match (or the file can't be found), something is wrong. I am mainly backing up RAW image files, so they should never change.

Basically...

$TEST_FILE=$(ls -p /source_dir | grep -v / | shuf -n1)

$TEST_ARCHIVE=$(borgmatic -c config.file list | shuf -n1)

borgmatic extract yada yada yada

md5sum $TEST_FILE restored_file


I don't use borg, but I used duplicity, which offers something like that. The verify operation simulates a backup restore to compare whether the restored file's checksum matches that expected from the metadata and optionally against the local file. I use this routinely, interesting to see that a local S3 provider can sometimes mess up your files silently.


I used their trial for a bit to test it out with Vorta [1] in a container. Vorta (and Borg) seemed to work fine, until I wanted to restore an archive and I noticed that my recent snapshots were completely empty. Probably because of a misconfiguration on my end though. But it made me look elsewhere. For me backups should be a fire, test and forget solution.

Recently I made the switch to Kopia [2] which seems to have feature parity with Borg (and Restic [3]). It also has a web UI which is way easier to work with than Vorta. And I can easily view, extract and restore individual files or folders from there. This gave me way more confidence about this solution. The only thing I really miss is that I cannot chose different targets for different paths. For instance, with Borg I was able to backup a partial of my Docker appdata to an external source. And I haven't found a way to do this with Kopia. Besides that I'm pretty happy with this solution and I would recommend it.

1. https://vorta.borgbase.com/

2. https://kopia.io/

3. https://restic.net/


My borgbase backups have been great, although according to my dashboard I'm on a different server instance.

I don't envy their sysadmin team right now. Good luck, folks.


isn't it one guy??




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: