I use Spark for a number of jobs for language-specific features still but I think within 2 years all custom code will be trivially invoked as native UDFs in SQL data warehouses (ie Snowflake, which has essentially solved big-data performance as a going concern).
I just write SQL in Snowflake and it replaces 95% of what I would otherwise have done in custom MapReduce or Spark code.
What I really dislike modern cloud DWH such as Snowflake is that it hides a lot of things from me. Since I'm not a CTO who worries about not delivering, but a junior DE who actually wants to learn things, I really prefer that things were done in the old ways where we had to manage our own infrastructure and our own code for ETL. These kinds of things can not be learned "just for fun" because one has to work in a real environment.
What do you mean? Of course you can still learn them "just for fun" if you want. There are plenty of columnar data warehouses (memsql, greenplum, vertica, clickhouse, etc) and data processing frameworks (spark, flink, etc) that you can look at, implement and run yourself.
What I'm saying is that you can surely scratch the basics from personal use, but it's completely different from real usage and such can only be trained on job. Now that those jobs are fewer as everyone goes on cloud.
SQL will always be faster than Hadoop and MapReduce. The main reason to use those other slower services is developer are not use to SQL or declarative programming, and insist on having the code in Procedural way.
That's completely backwards. Mapreduce-like approaches are how SQL datastores are implemented underneath; the absolute best case for SQL is to equal hand-tuned mapreduce-like performance, and often it will be slower (you're at the mercy of your query planner to pick the right indices, do joins in the right order, etc.). The main reason people use SQL is because they find it easier to express a query that way (which is completely legitimate - if your query planner is good enough most of the time, you've got better things to be doing than hand-tuning your query execution).
No, that does not seem correct. SQL Datastores are not "map-reduce underneath", they have optimized datastructures for efficient querying (i.e. indices). Map-reduce is equivalent to those cases in SQL database where you have full table scan in your query plan - basically brute-forcing your way through the dataset.
You can (and often should) have indices in a map-reduce situation as well - you just build them in an explicit, visible way. But in most of the relevant use cases you're doing some kind of aggregation over the whole table, so indices don't help any.
And if your primary use-case is column-wise aggregation over the whole table, in SQL you'd use a (compressed) column store rather than a row store as your table storage method.
To be fair, Parquet, which is commonly used in Big Data solutions is a column store format. So, once you normalize your data and save it as Parquet you can have efficient column-wise aggregation - but that assumes some preprocessing step.
That makes no sense. SQL is a query language, commonly implemented by relational databases.
In the early 2000s, columnar relational data warehouses were not sophisticated and scalable enough to handle the scale of data encountered at Yahoo, Google and other internet companies. MapReduce (and the many evolutions of Hadoop ecosystem) was created to scale processing through low-level instructions and algorithms.
Eventually columnar data warehouses caught up and are now capable of handling petabyte scale, regardless of whatever language you use to query them. The fundamental storage and compute primitives haven't really changed that much, just offered in a much more user-friendly way now.
SQL itself is just a query language, it's the underlying cloud based data warehouse that fulfills the role of what map/reduce used to do in terms of parallelization transparently.
1) Why would you want to maintain your own Spark infrastructure? Spark on Kube is a huge improvement over YARN but you still have to deal with OOMEs, filled disks, Kube upgrades, pushing custom images to container registries, etc etc etc.
2) Snowflake is probably 10-50x as performant as Spark for data manipulation. I don't know what kind of unholy demonic incantations Snowflake is doing on the backend to support their SQL performance, but it's really freaking fast. There's just no other way to cut it.
I've spent 5-10 years eking every ounce of performance I can get out of a Hadoop/Spark cluster. I'm not trying to be unreasonable about this. I would love for OSS to be competitive; it's great for the world, and it would be great for my skill set and earning potential.
But it's not a contest, and if you think standalone Spark is going to be a viable competitor in a couple years, you are deluding yourself. Make informed choices about your career and investment.
A lot of articles I read about snowflake involves data vault which is a massive turn off. And when their tech lead (Kent Graziano) is a prominent figure in the DV bullshit...
Snowflake and DV have no interdependency whatsoever. Snowflake is just a database. Whether you use DV to model the data inside of it or dimensional modelling or "big wide tables" is completely up to you, there's nothing about it that requires or benefits DV in particular.
You should try Databricks, especially the new Photon engine powering Spark. In general more performant than Snowflake in SQL and a lot more flexible. (There are some cases in which Databricks would be slower but the perf is improving rapidly.)
Databricks has an extremely bad API. So, sure, your Spark jobs might be a little bit faster some times, but why would you use it if you can't even read logs of running jobs?
Databricks is amazing, the Delta Live Table technology is incredible. It's very hard to approach problems like Data Lineage and Data Quality, but that platform does it in the right way.
My only concern is that they offer just a managed cloud product. That's cool for startups, but large enterprises sometimes need more governance and ownership than that.
I just write SQL in Snowflake and it replaces 95% of what I would otherwise have done in custom MapReduce or Spark code.