A non-trivial portion of my consulting work over the past 10 years has been work...

lizard · 2025-05-06T02:13:40 1746497620

Just curious, what would you consider, "absurdly small amounts of data around using big data tools like spark" and what do you recommend instead?

I recently worked on some data pipelines with Databricks notebooks ala Azure Fabric. I'm currently using ~30% of our capacity and starting to get pushback to run things less frequently to reduce the load.

I'm not convinced I actually need Fabric here, but the value for me has been its the first time the company has been able to provision a platform that can handle the data at all. I have a small portion of it running into a datbase as well which has been constant complaints about volume.

At this point I can't tell if we just have unrealistic expectations about the costs of having this data that everyone wants, or if our data engineers are just completely out of touch with the current state of the industry, so Fabric is just the cost we have to pay to keep up.

speakfreely · 2025-05-06T22:31:58 1746570718

One financial services company has hundreds of Glue jobs that are using pyspark to read and write less than 4GB of data per run. These jobs run every day.

emmelaich · 2025-05-06T01:48:06 1746496086

I'm aware of a govt agency with a few hundred gb of data using Mongo, Databricks and were being pushed towards Snowflake as well. Boggles the mind.

spratzt · 2025-05-06T07:39:23 1746517163

I used to do similar work. Back in the day I used 25 TB as the cut off point for single node design. It’s certainly larger now.

jimbokun · 2025-05-06T01:43:40 1746495820

Which is also a reason to not use Databricks, as they will cost your company money by selling gullible users things they don’t need.