Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The amount of data this thing will be putting out every night is insane. For years now the community has been building the infrastructure to be able to efficiently consume it for useful science, but we still have work to do. Anyone interested in the problem of pipelining and distributing 10s of TB of data a night should check out the LSST and related GitHubs.



I've followed this project for over a decade and the amount of data they are moving around is fairly routine, given their budget size and access to computing and networking resources. The total storage (~40-50PB) is pretty large, but moving 10TB around the world isn't special engineering at this point.


It's not about the size of the data in bytes, it's also the amount of changes that need to be detected and alerts that need to be sent out (estimated at millions a night). Keep in mind the downstream consumers of this data are mostly small scientific outfits with extremely limited software engineering budgets.


Again, nothing special. The small outfits aren't going to be doing the critical processing.


…they do the science


I've worked on quite a few large-scale scientific collaborations like this (and also worked on/talked to the lead scientists of LSST) and typically, the end groups that do science aren't the ones handling the massive infrastructure. That typically goes to well-funded sites with great infrastructure who then provide straightforward ways for the smaller science groups to operate on the bits of data they care about.

Here's the canonical example: https://home.cern/science/computing/grid and a lab that didn't have enough horsepower using a different grid: https://osg-htc.org/spotlights/new-frontiers-at-thyme-lab.ht...

Personally, I have pointed the grid folks (I used to work on grid) towards cloud, and many projects like this have a tier 1 in the cloud. The data lives in S3, metadata in some database, and use cloud provider's notification system. The scientists work in adjacent AWS accounts that have access to those systems and can move data pretty quickly.


The difference with this project is the data from Rubin itself isn’t where most of the scientific value comes from. It’s from follow up observations. Coordinating multiple observatories all with varying degrees of programmatic access in order to get timely observations is a challenge. But hey if you insist on being an “everything is easy” Andy I won’t bother anymore.


If you’re dealing with a fairly constant amount of data every day for years, using the cloud will be way more expensive than necessary.


The whole thread comes off as an AWS sales pitch...


I've setup and built my own machines and clusters, as well as setting up grids, and industrial scale infrastructure. I've seen many closet clusters, and clusters administrated by grad students. Since then, I've gone nearly 100% cloud (with a strong preference for AWS).

In my experience, there are many tradeoffs using cloud but I think when you consider the entire context (people-cost-time-productivity) AWS ends up being a very powerful way to implement scientific infrastructure. However, in consortia like this, it's usually architected in a way that people with local infrastructure (campus clusters, colo) can contribute- although they tend to be "leaf" nodes in processing pipelines, rather than central players.


Why move the data? Why not just enable permissions on cloud sharing a la Snowflake or iceberg?


Sure, that also works, although it often leads to problems around cost and scalability and environment customization.


Is this not the same problem high resolution spy satellites have? Seems like a fair bit of crossover at least?


Spy sats are more bandwidth and power constrained. For low earth, you also can’t usually offload data over the target.


> For low earth, you also can’t usually offload data over the target.

That capability is coming with starlink laser modules. They've already tested this on a dragon mission, and they have the links working between some satellite shells. So you'd be able to offload data from pretty much everywhere starlink has presence.


Vera Ruben is producing ~4gbps constantly. just dealing with the heat to send that much data is highly nontrivial.


Yep, the data engineering side of this is just as fascinating as the astronomy




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: