I use Spark for a number of jobs for language-specific features still but I thin...

markus_zhang · on March 8, 2022

What I really dislike modern cloud DWH such as Snowflake is that it hides a lot of things from me. Since I'm not a CTO who worries about not delivering, but a junior DE who actually wants to learn things, I really prefer that things were done in the old ways where we had to manage our own infrastructure and our own code for ETL. These kinds of things can not be learned "just for fun" because one has to work in a real environment.

manigandham · on March 8, 2022

What do you mean? Of course you can still learn them "just for fun" if you want. There are plenty of columnar data warehouses (memsql, greenplum, vertica, clickhouse, etc) and data processing frameworks (spark, flink, etc) that you can look at, implement and run yourself.

It's all using the same principles underneath.

markus_zhang · on March 8, 2022

What I'm saying is that you can surely scratch the basics from personal use, but it's completely different from real usage and such can only be trained on job. Now that those jobs are fewer as everyone goes on cloud.

tharne · on March 9, 2022

> What I really dislike modern cloud DWH such as Snowflake is that it hides a lot of things from me.

This is by design. The less skill required to use a tool, the less some CTO has to pay the person using it.

deepstack · on March 8, 2022

SQL will always be faster than Hadoop and MapReduce. The main reason to use those other slower services is developer are not use to SQL or declarative programming, and insist on having the code in Procedural way.

lmm · on March 8, 2022

That's completely backwards. Mapreduce-like approaches are how SQL datastores are implemented underneath; the absolute best case for SQL is to equal hand-tuned mapreduce-like performance, and often it will be slower (you're at the mercy of your query planner to pick the right indices, do joins in the right order, etc.). The main reason people use SQL is because they find it easier to express a query that way (which is completely legitimate - if your query planner is good enough most of the time, you've got better things to be doing than hand-tuning your query execution).

988747 · on March 8, 2022

No, that does not seem correct. SQL Datastores are not "map-reduce underneath", they have optimized datastructures for efficient querying (i.e. indices). Map-reduce is equivalent to those cases in SQL database where you have full table scan in your query plan - basically brute-forcing your way through the dataset.

lmm · on March 8, 2022

You can (and often should) have indices in a map-reduce situation as well - you just build them in an explicit, visible way. But in most of the relevant use cases you're doing some kind of aggregation over the whole table, so indices don't help any.

tremon · on March 8, 2022

And if your primary use-case is column-wise aggregation over the whole table, in SQL you'd use a (compressed) column store rather than a row store as your table storage method.

988747 · on March 8, 2022

To be fair, Parquet, which is commonly used in Big Data solutions is a column store format. So, once you normalize your data and save it as Parquet you can have efficient column-wise aggregation - but that assumes some preprocessing step.

manigandham · on March 8, 2022

That makes no sense. SQL is a query language, commonly implemented by relational databases.

In the early 2000s, columnar relational data warehouses were not sophisticated and scalable enough to handle the scale of data encountered at Yahoo, Google and other internet companies. MapReduce (and the many evolutions of Hadoop ecosystem) was created to scale processing through low-level instructions and algorithms.

Eventually columnar data warehouses caught up and are now capable of handling petabyte scale, regardless of whatever language you use to query them. The fundamental storage and compute primitives haven't really changed that much, just offered in a much more user-friendly way now.

Cthulhu_ · on March 8, 2022

SQL itself is just a query language, it's the underlying cloud based data warehouse that fulfills the role of what map/reduce used to do in terms of parallelization transparently.

mohanmcgeek · on March 8, 2022

But I can write SQL in Spark just fine. Can't I?

bpodgursky · on March 8, 2022

You can write some SQL in Spark, but

1) Why would you want to maintain your own Spark infrastructure? Spark on Kube is a huge improvement over YARN but you still have to deal with OOMEs, filled disks, Kube upgrades, pushing custom images to container registries, etc etc etc.

2) Snowflake is probably 10-50x as performant as Spark for data manipulation. I don't know what kind of unholy demonic incantations Snowflake is doing on the backend to support their SQL performance, but it's really freaking fast. There's just no other way to cut it.

I've spent 5-10 years eking every ounce of performance I can get out of a Hadoop/Spark cluster. I'm not trying to be unreasonable about this. I would love for OSS to be competitive; it's great for the world, and it would be great for my skill set and earning potential.

But it's not a contest, and if you think standalone Spark is going to be a viable competitor in a couple years, you are deluding yourself. Make informed choices about your career and investment.

Foobar8568 · on March 8, 2022

A lot of articles I read about snowflake involves data vault which is a massive turn off. And when their tech lead (Kent Graziano) is a prominent figure in the DV bullshit...

fritkot · on March 8, 2022

Snowflake and DV have no interdependency whatsoever. Snowflake is just a database. Whether you use DV to model the data inside of it or dimensional modelling or "big wide tables" is completely up to you, there's nothing about it that requires or benefits DV in particular.

secondcoming · on March 8, 2022

What is DV?

belter · on March 8, 2022

Count not knowing what it is, as a blessing. Run if you can.

https://en.wikipedia.org/wiki/Data_vault_modeling

Edit: As the Wikipedia article has no Criticism section I will add some references:

http://kejser.org/the-data-vault-vs-kimball-round-2/

https://timi.eu/blog/data-vaulting-from-a-bad-idea-to-ineffi...

fritkot · on March 8, 2022

It's a data modelling method for data warehouses, it can be used in Snowflake or on any other data management platform.

mohanmcgeek · on March 11, 2022

> Snowflake is probably 10-50x as performant as Spark for data manipulation

Wow is this for a fact? I haven't used either in a while but I saw the blog post from databricks and Spark was more performant than snowflake.

I assumed that's what I'll also get when i run spark on kube

rxin · on March 8, 2022

You should try Databricks, especially the new Photon engine powering Spark. In general more performant than Snowflake in SQL and a lot more flexible. (There are some cases in which Databricks would be slower but the perf is improving rapidly.)

belter · on March 8, 2022

Probably an oversight on your part, but I would argue would be elegant to disclose you are one of the co-founders.

Fiahil · on March 8, 2022

Databricks has an extremely bad API. So, sure, your Spark jobs might be a little bit faster some times, but why would you use it if you can't even read logs of running jobs?

Lucasoato · on March 8, 2022

Databricks is amazing, the Delta Live Table technology is incredible. It's very hard to approach problems like Data Lineage and Data Quality, but that platform does it in the right way.

My only concern is that they offer just a managed cloud product. That's cool for startups, but large enterprises sometimes need more governance and ownership than that.

soulbadguy · on March 8, 2022

Very surprised by this. Do you have a reference ?

fnord123 · on March 8, 2022

Fyi, rxin is co-founder of databricks.

soulbadguy · on March 8, 2022

That explains it

StreamBright · on March 8, 2022

The biggest selling point of Snowflake for most of the customers is that they do not need to maintain the infrastructure.

fritkot · on March 8, 2022

Of course you would say that it's more performant and flexible ...TCP-DS was just a PR ploy