A non-trivial portion of my consulting work over the past 10 years has been working on data pipelines at various big corporations that move absurdly small amounts of data around using big data tools like spark. I would not worry about purchasing services from Databricks, but I would definitely try to poach their sales people if you can.
Just curious, what would you consider, "absurdly small amounts of data around using big data tools like spark" and what do you recommend instead?
I recently worked on some data pipelines with Databricks notebooks ala Azure Fabric. I'm currently using ~30% of our capacity and starting to get pushback to run things less frequently to reduce the load.
I'm not convinced I actually need Fabric here, but the value for me has been its the first time the company has been able to provision a platform that can handle the data at all. I have a small portion of it running into a datbase as well which has been constant complaints about volume.
At this point I can't tell if we just have unrealistic expectations about the costs of having this data that everyone wants, or if our data engineers are just completely out of touch with the current state of the industry, so Fabric is just the cost we have to pay to keep up.
One financial services company has hundreds of Glue jobs that are using pyspark to read and write less than 4GB of data per run. These jobs run every day.