Can someone ELI5 what Snowflake and Databricks are? I spent a few minutes on the...

mping · on Nov 13, 2021

A data lake is a system designed for ingesting, and possibly transforming lots of data, a "lake" where you dump your data. This is different from an eg postgres db (a single source of truth for a crud app for example), because it captures more data (eg events) and it's normally not consistent with the single source of truth (the data may arrive in batches, imported from other database, etc). Because the volume of data is normally huge, you need a cluster to store it, and some way of querying it.

Snowflake and data bricks are companies that operate in this space, providing ways to ingest, transform and analyze large volumes of data.

IanCal · on Nov 13, 2021

Snowflake is (amongst other things but primarily to me) SQL database as a service, designed for analytical queries over large datasets.

It separates compute and storage, so there's just a big ol' pile of data and tables, then it spins up large machines to crunch the data on demand.

Data storage is cheap and the machines are expensive per hour but running for shorter times, and with little to no ops work required it can be a cheap overall system.

Bunch of other features that are handy or vital depending on your use case (instant data sharing across accounts, for example).

I've used it to transform terabytes of JSON into nice relational tables for analysts to use with very little effort.

Hopefully that's a useful overview of what kind of thing it is and where it sits.

legerdemain · on Nov 13, 2021

Snowflake is a hosted database that uses SQL. Two distinctions it has is that (1) it lets users pay for data storage and compute power separately and independently and (2) it takes decisions about data indexing out of your hands.

Databricks is a vendor of hosted Spark (and is operated by the creators of Spark). Spark is software for coordinating data processing jobs on multiple machines. The jobs are written using a SQL-like API that allows fairly arbitrary transformations. Databricks also offers storage using their custom virtual cloud filesystem that exposes stored datasets as DB tables.

Both vendors also offer interactive notebook functionality (although Databricks has spent more time on theirs). They're both getting into dashboarding (I think).

Ultimately, they're both selling cloud data services, and their product offerings are gradually converging.

ngc248 · on Nov 13, 2021

A data lake is a company wide data repository. All the "data streams" from all of the different departments will flow into the data lake. Aim is to use this data to get both macro and micro insights.

zerotosixty · on Nov 13, 2021

They are a data warehouse with analytics? So data warehouse as a service in the cloud?

So they can collect data from different places like sql, images, etc. I think a better question would be what type of data can't they ingest?

Once you have your data i guess you can run some analytics to find out what your data tells you

tomnipotent · on Nov 14, 2021

A data lake can be home to many different data formats e.g. parquet, AVRO, Thrift, protobuf, ORC, HDF5S, CSV, JSON all co-existing together. Spark lets you create a virtual abstraction over all of this, and query it as though it was a homogeneous database. There's no need to import data into a centralized format and schema.

This really all ties back to the "old" Hadoop days, and is an evolution of compute over data not in a fixed and managed format/schema.

geoduck14 · on Nov 13, 2021

I'd like to add some points: Ive used Snowflake for several years. Snowflake works with structured and semi-structured data (think spreadsheets and JSON). I've never tried working with pics or videos - and I'm not sure it would make sense to do that.

I've evaluated Databricks. It works with the above mentioned structured and semi-structured data. I also suspect it could process unstructured data. My understanding is that it runs Python (and some others), so you can do any "Python stuff, but in the cloud, and on 1000s of computers"

62951413 · on Nov 14, 2021

Databricks used to be an Apache Spark as a service company. And Spark is a predominantly Scala code base. PySpark is just a Python binding for the real engine popular in ML circles. In the last couple of years the Databricks platform migrated from open-source Spark to a new proprietary engine written in C++.

tomnipotent · on Nov 14, 2021

You're referring to PySpark, which still does all the heavy lifting in the JVM.

jeffreygoesto · on Nov 13, 2021

People who downvoted this, please take a minute and reflect that your world is not the whole world. There is a serious question in this comment and there are myriads of topics _you_ have no clue about.

dekhn · on Nov 13, 2021

sure, but if I see the term 'data lake' I'm gonna Bing it, with the first result being https://aws.amazon.com/big-data/datalakes-and-analytics/what... which explains it nicely.

ELI5 is for reddit, generally here we expect you can google it to get the ELI5 explanation before giving us your hot take in a comment

pxc · on Nov 13, 2021

Yeah, that's exactly the kind of content I found unsuitable when I did a web search for the term. It spends a whole two sentences giving an explanation that tells me very little about how data lakes are anything more specific than a cloud-hosted database solution, and moves on to

> Organizations that successfully generate business value from their data, will outperform their peers.

at which point I'm like

> ok, I'm reading a covert advertisement about Fancy Cloud Technology aimed at some kind of big-spending manager, which is unlikely to tell me meaningfully what this actually is

and I'm out. I was looking for content that was in a more neutral, purely educational genre, and wondering what collection of non-cloud analogues it replaces/is composed of. Someone writing in the comments

> I used it to transform several terabytes of JSON into nice relational data for analysts without too much effort

is way, way more direct and helpful than mentioning that 'unlike data warehouses, data lakes support non-relational data'. Like great, it's a cloud thing that supports a variety of databases. But what is it?

> before giving us your hot take in a comment

I didn't give any take at all? I just really found all the sources that came up on the first page of search results to be almost in the wrong genre for me, and expected (correctly) that people on this site would be able to produce descriptions in 1-5 sentences that worked way better for me.

Pretty much all of the answers I got here were really good, and I'm glad I asked.

fragmede · on Nov 14, 2021

> What is a data lake?

> A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.

This may be self-explanatory for you, but what it means in practice is not as self-evident as you believe. For all it describes, it could be an FTP upload directory that loads things into an sqlite database. It's not until the scale is invoked (multi-terabyte/day) that the inadequacies of a naive solution become apparent. For those in that area of the industry, Snowflake is already known. (Seriously, if you're running into issues with limitations of RedShift, it behooves you to take a look at Snowflake.) For those that aren't, data warehousing is unfamiliar, never mind data lake. For those outside the ML sphere, the finer points of training runs are also non-obvious.

uvdn7 · on Nov 13, 2021

It’s probably just me but the distinction between datalake and data warehouse seems like splitting hairs. Unstructured data can always be stored on structure databases. What’s the main reason for both to coexist?

dekhn · on Nov 13, 2021

History matters here and I don't know how well this is documented, but: data warehouses have been around since the 70s or so, data lake is a newer term. Data warehouses came from an era where nearly all data was stored in the database itself (typically Oracle), owned and controlled by a single or few groups, and there were only a few databases, which were the source of truth (the two databases would normally be a transaction engine handling real time load (just what's required to authorize a credit card transaction, for example), and a "warehouse" which contained all the long-term data like every transaction that had ever occurred.

Data lakes are more modern and came about as people realized they had 30 databases and the business wanted to do queries against all of them simultaneously (IE, join your credit card transaction history with historical rates of default in a zip code), quickly. The data warehouse solution was to use federated database queries (JOINs across databases), or force everybody to consolidate. A data lake is a single virtual entity that represents "all your data in one place".

It's based on a weak analogy where a warehouse is a place where you put stuff in very well organized locations while a lake is a place where a bunch of different waters slosh together.

Storing unstructured data in a database is dumb because databases cost about 10X storage space due to indexing, while unstructured data often can just sit around passively in a filesystem (and/or have a filesystem index built into it for fast queries).

I view this through the lens of web tech, for example, see the wars between the mapreduce and database people and how Google evolved from MapReduce against GFS to Flumes against Spanner, showing we just live in an endless cycle of renaming old technology.

It's absolutely correct that the terminology doesn't map perfectly

pxc · on Nov 13, 2021

This was really helpful, too. Thanks!

glogla · on Nov 13, 2021

It used to be that way. Old data warehouses (built on relational dbs) couldn't handle large scale data, and old data lakes used to be hard to use (write a map-reduce job to query data).

It is barely true nowadays.

incomplete · on Nov 13, 2021

i worked at excite.com right after the IPO, and front and center in the HQ building was a MASSIVE glass wall showcasing the oracle data warehouse machine room.

i didn't enjoy working w/either the datastore directly, or the DBA team that ran it either. an early, more old-white-dude "i just want to serve 5T"