The amount of data this thing will be putting out every night is insane. For yea...

dekhn · 2025-06-23T23:03:24 1750719804

I've followed this project for over a decade and the amount of data they are moving around is fairly routine, given their budget size and access to computing and networking resources. The total storage (~40-50PB) is pretty large, but moving 10TB around the world isn't special engineering at this point.

WD-42 · 2025-06-23T23:10:51 1750720251

It's not about the size of the data in bytes, it's also the amount of changes that need to be detected and alerts that need to be sent out (estimated at millions a night). Keep in mind the downstream consumers of this data are mostly small scientific outfits with extremely limited software engineering budgets.

dekhn · 2025-06-23T23:47:51 1750722471

Again, nothing special. The small outfits aren't going to be doing the critical processing.

WD-42 · 2025-06-24T00:16:22 1750724182

…they do the science

dekhn · 2025-06-24T00:31:10 1750725070

I've worked on quite a few large-scale scientific collaborations like this (and also worked on/talked to the lead scientists of LSST) and typically, the end groups that do science aren't the ones handling the massive infrastructure. That typically goes to well-funded sites with great infrastructure who then provide straightforward ways for the smaller science groups to operate on the bits of data they care about.

Here's the canonical example: https://home.cern/science/computing/grid and a lab that didn't have enough horsepower using a different grid: https://osg-htc.org/spotlights/new-frontiers-at-thyme-lab.ht...

Personally, I have pointed the grid folks (I used to work on grid) towards cloud, and many projects like this have a tier 1 in the cloud. The data lives in S3, metadata in some database, and use cloud provider's notification system. The scientists work in adjacent AWS accounts that have access to those systems and can move data pretty quickly.

WD-42 · 2025-06-24T02:51:16 1750733476

The difference with this project is the data from Rubin itself isn’t where most of the scientific value comes from. It’s from follow up observations. Coordinating multiple observatories all with varying degrees of programmatic access in order to get timely observations is a challenge. But hey if you insist on being an “everything is easy” Andy I won’t bother anymore.

Tepix · 2025-06-24T06:30:53 1750746653

If you’re dealing with a fairly constant amount of data every day for years, using the cloud will be way more expensive than necessary.

spacecadet · 2025-06-24T10:29:18 1750760958

The whole thread comes off as an AWS sales pitch...

dekhn · 2025-06-24T17:45:22 1750787122

I've setup and built my own machines and clusters, as well as setting up grids, and industrial scale infrastructure. I've seen many closet clusters, and clusters administrated by grad students. Since then, I've gone nearly 100% cloud (with a strong preference for AWS).

In my experience, there are many tradeoffs using cloud but I think when you consider the entire context (people-cost-time-productivity) AWS ends up being a very powerful way to implement scientific infrastructure. However, in consortia like this, it's usually architected in a way that people with local infrastructure (campus clusters, colo) can contribute- although they tend to be "leaf" nodes in processing pipelines, rather than central players.

gregw2 · 2025-06-24T11:34:06 1750764846

Why move the data? Why not just enable permissions on cloud sharing a la Snowflake or iceberg?

dekhn · 2025-06-24T17:45:55 1750787155

Sure, that also works, although it often leads to problems around cost and scalability and environment customization.

le-mark · 2025-06-23T23:06:06 1750719966

Is this not the same problem high resolution spy satellites have? Seems like a fair bit of crossover at least?

_alternator_ · 2025-06-24T00:30:12 1750725012

Spy sats are more bandwidth and power constrained. For low earth, you also can’t usually offload data over the target.

NitpickLawyer · 2025-06-24T05:51:17 1750744277

> For low earth, you also can’t usually offload data over the target.

That capability is coming with starlink laser modules. They've already tested this on a dragon mission, and they have the links working between some satellite shells. So you'd be able to offload data from pretty much everywhere starlink has presence.

adgjlsfhk1 · 2025-06-24T15:56:59 1750780619

Vera Ruben is producing ~4gbps constantly. just dealing with the heat to send that much data is highly nontrivial.

KurSix · 2025-06-24T13:23:52 1750771432

Yep, the data engineering side of this is just as fascinating as the astronomy