Yes it is :) You have to normalize data taken from various sources of various ag...

markus_zhang · on June 10, 2019

from my humble experience if you have a sales or product team keeps pumping out spreadsheets in weird formats you need someone dedicating a few hours to get a proper etl, and if they are constantly changing the format or adding new things you need a dedicate person just for that. Modern tools like Python or Power Query are not enough for this eternal war.

matwood · on June 10, 2019

It's not that, it's the systems. 15 years ago I built a pretty sophisticated for its time data warehouse for a company that ran call centers. The amount of data that came off of the call systems was staggering, and the format arcane. Every vendor patch had the potential to wreck the ETL process. Then there was account data from clients, and other internal systems.

The people and their spreadsheets was the easy part to control.

da_chicken · on June 11, 2019

This basically reads like, "You need to have a data engineer." Or half an engineer and half an analyst.

danielscrubs · on June 11, 2019

Let’s say you have 20000 tables in total for a company. They are in 10 different databases. You have no overview over the data and no comments. You don’t have a starting point for where information x are.

Welcome to my reality.

Would I love a data architect and a domain expert in my team? Yeah.

Will I run around booking meetings with everyone that even hints at working with data like a headless hen? Yeah.

Is this the normal procedure for Data Scientists in big and old companies? More so than I would like.

Oh! And I forgot that the security department will constantly deny your access to data you need (until you force their hand).

DEADBEEFC0FFEE · on June 11, 2019

Everything you mention is true and is compounded if the data healthcare related. Privacy concerns, data from different systems that claim to be the same. Preventing reidentification.

jlj · on June 12, 2019

If you can get your data safely to S3, Athena can handle a lot of reporting and analysis use cases. The table or view definition can handle the normalization process. Full on ETL pipelines are sometimes (but not always) more engineering than necessary.

(Disclaimer: I work in data engineering at Amazon and use those tools in my day to day)