Their current ETA for service restoration is another 2-4 days.
Despite the warning banner the uptime for customers on us10 instance which has been unavailable for almost a week now is showing at 100%.
They are compensating affected customers by crediting 4 weeks subscription but I must say this does make me wonder about their redundancy and recovery architecture.
Opshugs to the folks at Borgbase.
--
> box-us10 offline for storage expansion
> We are currently experiencing a temporary outage on our box-us10 server due to an unplanned expansion process. We added two new hard drives to the server in an effort to enhance capacity and accommodate future growth. However, we did not anticipate that the expansion process would require the server storage to be temporarily offline.
> To prioritize the safety and integrity of your existing data, we have decided to keep the box-us10 server offline until the expansion is successfully completed. It's estimated that about a week of downtime will be needed for the expansion. We understand the inconvenience this may cause and sincerely apologize for any disruption to your services.
> Current progress: 26% (Sunday morning UTC)
> Date Created: 2023-08-11 16:17:36 (3 days ago)
> Last Updated: 2023-08-13 18:15:53 (15 hours ago)
> me wonder about their redundancy and recovery architecture
You can pretty much guarantee it's untested and unlikely to work, when someone doesn't understand how drive arrays work, or test expanding drive arrays before doing it in prod...
They messed up and either ran out of disk space or had an array fail. Everything in that statement reads like they are trying to frame a critical production failure as a glitch that occurred during maintenance. This wasn’t about a routine storage expansion. It was an emergency caused by some operational screwup and the only way to fix it is to rebuild the array with new disks.
I had a raid6 array going almost bad when 2 HDDs called it a day at the same time years ago. Had to send the entire office of 30 people to unscheduled holiday for two days before the array was rebuilt with new drives. Third drive failing during the rebuild would likely result in a week off as the storage would need a full rebuild.
I can imagine something similar happening here, however there should be a fail over storage setup in case they're selling a service.
They messed up here in some way, but assuming you know what events and complaining about their skills is just making stuff up at this point. Unless you can tell us exactly what happened, maybe hold off with that kind of criticism.
Possible, but that's still guessing. Also possible: extra load while adding storage caused multiple drive failures, so suddenly they're without spares and want to finish the migration as soon as possible, so they stopped the customer traffic.
"did not anticipate" can be interpreted in different ways too. It may be "didn't know it could happen", or just "estimated it to be so unlikely that emergency downtime was acceptable as a response".
I've seen outside speculations as an inside person before and they're often very likely to be BS. Let's stick to facts and skip the "obviously" and "not possible" until we learn more. What's happening right now may be their prepared and tested procedure. (Or it may not)
And three, they're admitting they don't back anything up (backup defined as not just slinging bytes somewhere but the availability to restore those bytes and bring a service back up on a timeline that is certainly well lower than 3 days. With regular testing to understand the timeline and ensure everything actually works.)
ps -- it's not about being bashing these folks. But these type of multiple operational incompetencies strongly call into question their ability to safeguard your bytes and produce them when you need them. It's kinda the whole point of the thing.
It’s a backup company so I would expect the store 1 copy plus the minimum redundancy to make drive failures a predictable amount of unlikely to cause data loss. More than that would be additional features.
(3) Be forced to switch to offline rebuild because the array was in a strange not-entirely-consistent state even before the failure, even though without the new drives you’d be in a routine online-rebuild situation
If so, possibility (2) could in principle not be detectable in staging, as it could depend on the drive’s age, and having to reread everything—as a rebuild does—is a rather abnormal load. The failure could even predate (1), if it happened to data nobody’d looked at for a long time.
This is less a question about these guys and more in general—is a RAID/ZFS/etc. array/pool/etc. in a vulnerable state after you’ve expanded it?
> and having to reread everything—as a rebuild does—is a rather abnormal load.
Far from being an abnormal load, it should be a cron job. Backups need to be verified. Data at rest needs to be periodically checked for bitrot. Reading everything on the array and verifying parity and any checksums is important preventive maintenance and an early warning system.
That a decision needs to be made under uncertainty doesn’t relieve one from the necessity of making the decision. That implies estimating an uncertain state of the world and drawing inferences from that, and whether those are framed as “criticism” is irrelevant.
In situations where you could plausibly get more data, it’s usually OK to admonish people to just go and get some, because whatever the difficulty is doing that, judging the current degree of uncertaintly is probably even harder. In situations where you couldn’t, though (trade or state secrets), there’s nothing to do but to use what data you do have and point out your uncertainty carefully.
(This is a very general argument because what you’ve said is a fully general counterargument to any criticism of anyone from anybody but an insider. As for this particular case, GP’s allegation of lacking ops knowledge seems to be on the charitable side to me: an uncharitable one, as stated elsethread, would be that they had one drive too many fail on them, whether due to incompetence or misfortune, and are forced to do an offline rebuild while choosing to lie about it. It’s the lie part that I’d have the greatest problem with, if it’s actually there.)
This is why I ultimately trust my personal data to Apple (and to a lesser and smaller extent, Google). I don't expect either to exactly have my best interest at heart - but I do expect both of them to have basic ops competency.
Large corporations have the technical skills in place, but organizationally they can lock your account away leaving you with no access and no real avenue to evict your data. There are many posts on HN and on the Internet where corporate support had no procedure to give back access to data when they blocked the account. Account bans are sometimes for unclear or out of control reasons (e.g. someone stole your credit card and then used that card with a another account, so the corporate banned all accounts indefinitely). In the case of a smaller organization, you have a better chance that if they have to say goodbye to you, they will do it without unnecessary harm to you.
That doesn’t prevent the deletion of backups and snapshots, encrypting the encrypted backups for a ransom, and using your account for distribution of the illegal content or malware.
Actually if I must entrust my data to companies like these where reliability and competence is of utmost importance - I’d rather trust Google and Facebook (much much before Apple).
If we are talking about privacy (or rather privacy theatre) then yeah Apple, why not.
Old Time Capsule isn't really a backup. You either need two HDD with BTRFS or ZFS set up as minimum. And then You will need an offsite backup option. iCloud , Google storage etc.
If you only go via the iCloud route, you have zero local copy of your data. If you only only go via local, it isn't safe enough option. So ending up with a combination of the two.
And Apple really wanted services revenue, if it was only just a Time Capsule I doubt Apple would ever make it again.
Yeah, I wouldn't trust their process at all, if they'd rather have the server down for 1 week instead of aborting and performing a rollback to the last known good state.
Are they using big disks and having to resilver or re-do their RAID setup across the added drives, requiring everything on the disks to be read and re-written?
> However, we did not anticipate that the expansion process would require the server storage to be temporarily offline.
This is embarrassing. Somebody made a big mistake of the “mistakes like this shouldn’t be made” category. Like not an accident or fat finger or a bug, but engineers not understanding fundamentals of how a system worked and having none of the processes in place to to dry runs in non production or any of the other things that would protect against something like this. Really questionable to trust an organization that makes a mistake like this.
"This is embarrassing. Somebody made a big mistake of the “mistakes like this shouldn’t be made” category."
You have no idea what happened.
Protecting against "something like this" often introduces complexity and failure cascades that are much more harmful than a simple system being offline.
In fact, I would consider it a feature if uptime was deliberately sacrificed for simplicity and data integrity.
I wish Manu the very best and will consider it not a failure but a success if this array/subsystem/whatever comes through without data loss.
Yes, the expected behaviour would be for the disks to be initialized. Beyond that, it'd be a configuration setting for what to do with new disks. It could be used to extend the existing LUNs, or added as a new pair in a RAID 10 style setup, or added as new members of a different level RAID.
Then it would be up to the sysadmin to extend the LUN, or divide the newly added space among multiple LUNs.
Different arrays have different behaviour with new disk. Some only add to offline LUNs. Some do everything online with lower performance, as you said. Even some super expensive enterprise kit will be very easy to break by adding new disk.
Well, in my sysadmin days we would have test/staging equipment that we'd make sure we knew what would happen when a change was made.
But I don't know the details of what really happened, so I can't make a judgement really. If I could edit my previous comment I would, to be less emphatic that someone did something "wrong".
It could have been that this particular disk array had different firmware or was faulty. All we know is that something unexpected happened.
I've loved BorgBase up until this point. I mean at least they emailed me quickly to tell me, but now I don't have backups.
Well that's not true, I backup to a number of locations and BorgBase is just one of them. But it's a chink of my redundancy lost and a huge outage like this for a week makes me wonder what I'm paying for. What if BorgBase WAS my only backup solution and I needed to roll back?
Everyone makes mistakes and I appreciate they were very upfront about it, but yea. Questioning if I'll be an ongoing customer now.
This is correct - we believed we had a working recipe that properly sandboxed the 'rclone mount' and 'rclone serve' directives but we couldn't make it work in a way that allowed us to sleep at night.
But now that is all changing...
It appears that we can run rclone serve restic --stdio in a way that doesn't actually serve anything or create sockets, etc., and as soon as we finish our tests we will have it in place and people can lock their accounts with something like this:
As a customer, that is great to hear (well, read). I was reading about how a restore scenario with Borg would look like if "append only" really matters (so in a ransomware scenario), and what I've seen makes me rather uncomfortable [1]. I don't want to have to do that in a stressful situation.
Is there a particular channel where you intend to post your findings? :)
Borgbase is pretty good. I’ve used it for 6 years.
The usage graph on each repo is great. I redid my server this year and was accidentally backing up a stale snapshot every night for one of my repos. It would succeed, but without changes. The flat line for usage tipped me off. I’m not sure I would have noticed otherwise since it was a VM image that I could restore and boot, but it wasn’t super obvious it was full of stale data.
This downtime isn’t ideal, but, if downtime is needed to preserve data integrity, I’ll take it over the risk of trying to maintain availability.
Been using hetzner storage boxes for borg backup for >5 years. Works without issue for multiple TB of backups.
Every maintenance downtime so far has been announced in advance.
The main reason I’ve avoided Hetzner storage box is because they don’t accept loading money into the account as a prepaid credit for a longer period. It’s always a month-by-month payment. Ok, so they do allow it, but only through a bank transfer, which is either not easy or possible for those outside the EU. If they allowed this through credit/debit cards or PayPal or some other mechanism, I’d surely try it. Hetzner in general seems to be very conservative in handling (and avoiding) risks.
I’d like to prepay for a year or longer for critical services so that in case something goes wrong (card expired, didn’t notice reminder emails, temporarily off the grid, temporarily incapacitated), things just don’t go poof.
I’ve had a couple of hetzner storage boxes- generally they’re ok.
Though just this weekend there was some screwiness with sub accounts and ssh keys not working that did resolve itself after 24-48 hours. I didn’t notice anything communicated about it.
> If you want “perfect architecture” and reliability/availability you just need to pay for it (aws/s3).
I find that really funny. AWS has outages regularly. Whole data centers who go dark. Remember the S3 outage because an admin fat-fingerly deleted a large number of servers? ...no, people forget very quickly because of the "No one was ever fired for using AWS" mantra.
A vague description or white lie might have been better here - a storage company with a fleet of servers shouldn’t ever be in this situation for the reason given
I’d like further transparency on this: why didn’t they anticipate that the array would be offline for this expansion? Expanding a storage environment can certainly be a mine-filled exercise for even the most experienced. But why was this expansion unexpectedly an offline operation?
I’m not throwing shade here. I’m genuinely curious because I’ve been considering Borg and Borgbase for an upcoming project.
I've been in this exact situation before in a storage expansion.
With enough 22TB spinning drives nowadays you can get in to a scenario where, with the new data being written into the array and the expansion process going on at the same time the rebuild essentially will never complete.
This is especially true with lower end CPU servers and without dedicated RAID cards.
It is dangerous because the rebuild stresses the drive which makes a failure more likely, and a failure during the rebuild process is not great, on top of that you've got a months long ETA...
> I've been in this exact situation before in a storage expansion.
Can you tell us what solution that was? Something with ZFS or BTRFS? My experience with classic RAID system is, you can't expand them without reinitializing them. (But that comes with obvious warnings about data loss.)
This may not be the answer you are looking for but one option is Ceph.
Several cheap, low-power NAS boxes running Linux/Ceph and throw disk at them.
Depending on the data you store you can have 3-way replication of Erasure Coding at your preferred risk/performance level.
Disk failures are painless, box failures don't lose data, expansion and re-balancing scales with the number of disks and pretty quickly the limiting factor is your network (4-5 modern spinning disks can easily saturate a 10G network link).
Traditional hardware RAID requires you to start over from scratch to expand an array. ZFS has some limited ability to expand an array, but there are still a lot of scenarios where the recommended solution is to start over from scratch. BTRFS has full flexibility for online resizing of arrays and changing between RAID modes and rebalancing the array, but its parity RAID modes aren't really trustworthy enough yet for a backup provider to be relying on.
Not always true, Dell PERC allows online expansion (you can even change raid levels on the fly in certain circumstances) but the controller switches from buffering writes to write-through cache policy during the operation which further slows things down.
I've used borg with rsync.net for several years for my primary off-site backup and have had fantastic reliability and support from them. They cover the usage of borg on their website[1] and support a range of other backup tools as well (including even just simply copying data over SFTP and relying on automatic ZFS snapshots on their end). I pay something in the region of $100 USD a year for this service.
My secondary backup is borg as well, this one hosted on a Hetzner Storage Box[2]. Having seen some bad practice in this thread, I want to make it clear that I am not duplicating the rsync instance to Hetzner but have created a separate instance entirely to avoid issues with the primary borg instance being replicated. From what I understand this is best practice (although it does mean I have to run the backup process twice--once to rsync.net and then again to Hetzner). Hetzner also provides info[3] on using their service with borg. I pay around 3.50 EUR a month for this service.
I chose rsync.net due to their CEO being active on HN and also because they're based in North America (away from me in the Pacific). Hetzner , I chose because they are based in Europe giving me both operational diversity (operated by two different unrelated companies) and locational diversity (Western US for rsync.net and Northern Europe for Hetzner. Hetzner sharp pricing also helped as well.
I'm sure I looked at BorgBase at one point for my EU-based backup location but something must have put me in Hetzner's direction instead.
I also store photos (not my general backup, for now) in AWS S3 (largely because that makes it easier to deploy my static photo album site which it itself deployed via S3+CloudFront). I use S3 standard storage (for the time being) for a bucket in Sydney (nearest to my location) and then a glacier storage in Ireland for a bucket which is automatically replicated from my Sydney bucket.
Hopefully all this is enough redundancy! But all up I only spend ~ USD20 or so a month for this peace of mind.
For me everything about rsync.net was great except for the throughput. It didn't matter which continent, isp, or operating system I tested, I couldn't get past single digit Mbps and sometimes had trouble reaching that. Support tried moving me to another server, but the problem persisted. Other than that I was pretty happy, but it was completely infeasible to store PostgreSQL backups there, much less server & laptop backups. Every now and then I consider going back in case they've fixed the root problem.
I guess now is a fun time to describe my cheap backup process. All of this is related to Linux and the backups are fully encrypted client-side. Under the hood `borgbackup` is used, but I use the PikaBackup app (see Flathub) to make it easier to interface with.
First I have a local copy of my data living on my actual hard drive.
Next I use PikaBackup (via borg) to encrypt and sync that to a cloud server I run that has about 200GB of storage for these backups for about $1/mo added cost.
Next I use the Backblaze Cloud tool to synchronize those encrypted backups to B2 using `b2 sync --delete` command. It runs automatically via cron every night. The costs here are about $0.01 per month since I only get charged what I actually use.
Backups are pruned and I can easily control the schedule, copies, etc. If I need to recover a file it mounts as a file system that I can easily navigate using any tools I want, including the command line. I can also mount older or different snapshots.
I've a similar local / S3 (compat) strategy but use Kopia. I currently only backup remotely to B2 (risk taker!), but could easily and cheaply add redundancy in here .. e.g S3 / R2 / random hosting. It's very cheap ~US$1. UI and default strategies are perfect for what's required. Completely automatic unless I need to browse some files in which case I just open the Kopia UI. I haven't used it, but understand Kopia also supports rsync.
Similarly, for my Linux devices I use `borg` to backup to a local NAS, and use `kopia` to have another backup to Google Cloud Storage, as B2 too slow where I live.
I used to have back up on local external drives too, but stopped doing that, since the process was manual and I often forgot about it.
The fact that as a customer that is impacted by this, I only found out about it when my backups and automated test restore failed is worrying.
I get that things happen, but not realising that this was going to be a service impacting event does not inspire continued confidence. Sure, this situation probably won't happen again, but what else don't they understand about their infrastructure?
It looks to me that their US10 is not like an AZ but an actual server with bunch of HBA and disks. So very much pets and not only in a single location but possibly in a single rack or even box.
You are (maybe) protected against a few disk failures but that's about it.
(off topic) I promised to reply to you in another thread, but I can't because replies are locked. feel free to reach out to me if you want - my contact is in my profile
Not OP, but I would guess it's something like this:
1. Make a e.g. 30MB file of random data
2. Copy it to "_reference" file
3. Upload the file to backup service
4. Restore the file from backup service
5. Diff restored file against reference
Pick a couple random files that should be in the repo, restore them from a random archive, check the md5sums against the source. If the md5sums don't match (or the file can't be found), something is wrong. I am mainly backing up RAW image files, so they should never change.
I don't use borg, but I used duplicity, which offers something like that. The verify operation simulates a backup restore to compare whether the restored file's checksum matches that expected from the metadata and optionally against the local file. I use this routinely, interesting to see that a local S3 provider can sometimes mess up your files silently.
I used their trial for a bit to test it out with Vorta [1] in a container. Vorta (and Borg) seemed to work fine, until I wanted to restore an archive and I noticed that my recent snapshots were completely empty. Probably because of a misconfiguration on my end though. But it made me look elsewhere. For me backups should be a fire, test and forget solution.
Recently I made the switch to Kopia [2] which seems to have feature parity with Borg (and Restic [3]). It also has a web UI which is way easier to work with than Vorta. And I can easily view, extract and restore individual files or folders from there. This gave me way more confidence about this solution. The only thing I really miss is that I cannot chose different targets for different paths. For instance, with Borg I was able to backup a partial of my Docker appdata to an external source. And I haven't found a way to do this with Kopia. Besides that I'm pretty happy with this solution and I would recommend it.
Despite the warning banner the uptime for customers on us10 instance which has been unavailable for almost a week now is showing at 100%.
They are compensating affected customers by crediting 4 weeks subscription but I must say this does make me wonder about their redundancy and recovery architecture.
Opshugs to the folks at Borgbase.
--
> box-us10 offline for storage expansion
> We are currently experiencing a temporary outage on our box-us10 server due to an unplanned expansion process. We added two new hard drives to the server in an effort to enhance capacity and accommodate future growth. However, we did not anticipate that the expansion process would require the server storage to be temporarily offline.
> To prioritize the safety and integrity of your existing data, we have decided to keep the box-us10 server offline until the expansion is successfully completed. It's estimated that about a week of downtime will be needed for the expansion. We understand the inconvenience this may cause and sincerely apologize for any disruption to your services.
> Current progress: 26% (Sunday morning UTC) > Date Created: 2023-08-11 16:17:36 (3 days ago) > Last Updated: 2023-08-13 18:15:53 (15 hours ago)