So I have to deal with both tools at my company. The reality is that it all feels so incomplete from a BI standpoint. Managers are throwing money at front end tools like Salesforce and Tableau, but the entire back end stack is still pretty much the same as 20-30 years ago (big expensive Oracle-ish databases).
I think the development of Python and Jupyter and other less known things like Vega are much more interesting. Python is today the only "glue code" that puts all of it together, from data to insights.
> the entire back end stack is still pretty much the same as 20-30 years ago (big expensive Oracle-ish databases).
Other than the expensive part, is it really such a bad thing? I feel like relational databases are a pretty good fit for a wide set of use cases and have a huge amount of tooling.
The businesses are coming with requests that require complex SQL on millions of records of data that normally is sitting in various sources (warehouse, salesforce, etc.). Unless you hire expensive data engineers, you can't do this type of work reliably. You can stick things together with expensive GUI oriented prep tools like Alteryx, but you pay in reliability and, quite frankly, sleep. And forget IT, IT is so stuck in their ways that you'd be waiting years for each analysis + you'd spend 10x what you should.
Isn't the problem space here genuinely complex in terms of business-complexity? Is there some better alternative that doesn't entail some other massive tradeoff such as managing your own servers, creating ingress mechanisms from multiple systems, building your own version of salesforce etc.?
In short, is there any solution that "does everything you could possibly want" while ensuring you _never_ need to hire a data engineer? This is a holy grail that I don't think exists.
You have to normalize data taken from various sources of various age and complexity. So you really have to understand the data. You also have to really understand the questions.
I've worked with (and on) lots of these tools and projects; the complexity is never in the frontend, it's dominated by getting the data, getting the data right and into the right format.
If all you want in the end is a good looking dashboard on a website then you might as well build it yourself; because of the cost structure that can even cost less than buying one of the BI frontend tools (there's not a lot of difference in development time, the the BI frontenders are more expensive because they are rarer and the licencing is high).
from my humble experience if you have a sales or product team keeps pumping out spreadsheets in weird formats you need someone dedicating a few hours to get a proper etl, and if they are constantly changing the format or adding new things you need a dedicate person just for that. Modern tools like Python or Power Query are not enough for this eternal war.
It's not that, it's the systems. 15 years ago I built a pretty sophisticated for its time data warehouse for a company that ran call centers. The amount of data that came off of the call systems was staggering, and the format arcane. Every vendor patch had the potential to wreck the ETL process. Then there was account data from clients, and other internal systems.
The people and their spreadsheets was the easy part to control.
Let’s say you have 20000 tables in total for a company. They are in 10 different databases. You have no overview over the data and no comments. You don’t have a starting point for where information x are.
Welcome to my reality.
Would I love a data architect and a domain expert in my team? Yeah.
Will I run around booking meetings with everyone that even hints at working with data like a headless hen? Yeah.
Is this the normal procedure for Data Scientists in big and old companies? More so than I would like.
Oh! And I forgot that the security department will constantly deny your access to data you need (until you force their hand).
Everything you mention is true and is compounded if the data healthcare related. Privacy concerns, data from different systems that claim to be the same. Preventing reidentification.
If you can get your data safely to S3, Athena can handle a lot of reporting and analysis use cases. The table or view definition can handle the normalization process. Full on ETL pipelines are sometimes (but not always) more engineering than necessary.
(Disclaimer: I work in data engineering at Amazon and use those tools in my day to day)
I am hiring one in Krakow! Seriously though, in a team of 10 business analysts I can barely afford 1 data engineer. Business analysts tend to cost less and also be more "business focuses", so they are an easier sell to management.
You seem to have an engineering problem, so hire engineers and perhaps fire some of those analysts. Don't make your business depend on someone else's tailored IT products, they will reap the profits you could be making.
Yes, I am a bit confused by that statement too. Isn't it good that complex tasks are handled by specialized people? Maybe it is my bias as an ex-big data engineer / current data scientist, but it seems to me that a lot of the tooling is is pretty simple as it can be (yes yes complacency is the enemy of good, I mean no obvious things to improve as low-hanging fruit)
Tableau has a data etl tool called “Prep” that helps with this problem. But it only goes so far. But I think that’s where the problem truly requires a data engineer.
There are plenty of modern day ETL tools like Funnel, Improvad or Dataddo to help with that part of the puzzle, though it does mean you have to pay another saas each month on top of Tableau.
Exactly. Instead of ETL, start writing your own Perl and various logical, reusable components. Roll your own ETL, however you want it, in a terminal. So what, you have to learn vim, big deal! Mouse driven interfaces are a huge part of the dysfunction.
Yea I've been a little confused here until I realized I would just write some bash, Python, Perl...etc script where some would advocate for complicated tools.
And after a few years you leave your job, a new person comes in and gets stuck with your script soup and lack of documentation.
Companies prefer well known products like Alteryx or Tableau because, despite the cost, it makes people easier to replace.
But i cant blame you for writing your own things. Im currently replacing a large SSIS-based etl proces with Python, because i'm sick of SSIS randomly breaking.
You make it sound as if this were a bad thing. RDBMS work well for many use cases. There are plenty of tools around to work with them. Good open source implementations exist.
The problem space (Business) is not complex, its incredibly simple.
Unfortunately the design patterns we use and the tooling is flawed.
It has been this way for at least 25 years and the RDBMS, languages, design patterns, and architectures ( 2-tier/ 3-tier) are the cause.
They make it simple to get started and even without knowing what you are doing you can easily churn out something that works if it is simple, doesn't change often, doesn't need to scale and deals with small amounts of data.
"It represents a quagmire which starts well, gets more complicated as time passes, and before long entraps its users in a commitment that has no clear demarcation point, no clear win conditions, and no clear exit strategy."
I feel like you are coming from an alternate universe. Nosql is the quick and easy thing to start with, then your needs become more and more complex and nosql just won't cut it anymore. Sure, nosql can scale and perform, but only if your needs are very specific and simple.
There are no major systems out there of even moderate complexity that aren't built on an rdbms.
Sorry I didn't explain it clearly. RDBMS are the reason we write business applications the way we do now. They are the root cause, swapping out RDBMS with NOSQL will solve nothing, because our languages, architectures, patterns, and libraries, how we even think about solving these problems all evolved on top of this and are flawed
That's the problem, business systems are not complex, they are incredibly simple, they are made complex by the way we think about and structure our models and then interact with them. We have the wrong boundaries, and the wrong languages
Would you be able to point me towards modelling approaches/boundaries/languages that would be more appropriate? I'd be interested to learn about better alternatives, as I don't yet see the big flaws in relational models
Counter-intuitively Datomic is in violent agreement with /u/rqmedes where he said "A better alternative is having the data, data model and business logic tightly bound in one place. Not separated in multiple "tiers"" – Datomic inverts/unbundles the standard database architecture such that the cached database index values are distributed out and co-located with your application code such that database queries are amortized to local cost. Immutability in the database is how this is possible without sacrificing strong consistency, basically if git were a database you end up at Datomic.
Unfortunately there is no real alternatives, its like operating systems, one or two system have so much momentum that using anything else becomes extremely difficult even when they are inferior in certain domains. see http://www.fixup.fi/misc/usenix-login-2015/login_oct15_02_ka...
I played with Denodo (data virtualization software) a couple years ago and thought it was pretty legit.
In theory, it could be used to provide that industrial strength abstraction layer between your Tableau/Looker/etc. and your bajillion weird and not-so-weird (RDBMS) data sources.
That would seem to make sense to me from the point of view of -- I would want my data visualization/analytics-type company to be able to concentrate on data visualization/analytics, not building some insane and never-ending data abstraction layer.
The part that surprised me was that Denodo could allegedly do a lot of smart data caching, thus speeding things up (esp hadoop-oriented data sources) and keeping costs down.
I'm guessing the other data virtualization providers can do similar.
I have had to work with the Denodo for the past 1+ year, a total nightmare. Data virtualization is a "good in theory" concept but "doesn't work in practice" reality. Going back to the original sources for each query doesn't work, it will always be slower than using a proper analytics data warehouse. Caching doesn't help because at that point you can just do ETL. Also Denodo itself is full of weird behaviors and bugs, my team collectively decided it's worth the most hate of all the "enterprise" tools we use. One thing Denodo is good for is as an "access layer", but then maybe PrestoDB would be worth a shot or maybe even just a sqlalchemy and python.
I don't understand why this gets downvoted. While it may lack context (with the aim of being controversial), it sparked a healthy amount discussion here!
My Salesforce clients have increasingly been considering Tableau. Tableau has great out-of-box discoverability of geo fields, dimensions and facts. The ability to auto-suggest a map of the U.S. based on data containing City/State is "magic" to power users.
The only barriers to Salesforce + Tableau adoption I noticed were cross-object JOINs and live vs cached data extracts.
Both issues were remedied by denormalizing the data prior to export. For example, a nightly flattened "view" of Opportunities with key related objects moved into columns.
Mulesoft is well suited to perfect the ETL challenges. Bringing them to the table could be a win for everyone.
Tableau might be friendly to analysts/end users, but its a piece of at the back end. The linux version they released is no more than a wrapper of original C code into JVM and thus causing huge performance issues and memory leaks. Being that said GL CRM
> I think the development of Python and Jupyter and other less known things like Vega are much more interesting.
In that case you may be interested in Dash (dash.plot.ly). It’s a free and open source library that you can use to create dashboards online with Python only.
We love Dash on our team, anything more than a Tableau dashboard goes into Dash. You can basically just treat it as a Flask app.
We write our back ends with FastAPI[1], which is usually just a wrapper around our ML models. Then serve both Dash and FastAPI with gunicorn. The backend is provided the uvicorn[2] worker class with the gunicorn -k arg[3] to greatly increase the speed as well.
For personal projects you can use this stack in GCP's AppEngine standard environment to basically host your (relatively low traffic) apps for free.
That's not totally accurate... If you use our SaaS Chart Studio product then yes, but otherwise (i.e. in 'offline' mode) you can use our Python, R and Javascript libraries as much as you like: they're MIT licensed.
I'd like to add that Dash helped me grok React far quicker than I'd expected. I code in a lot of React now and I'm only here because of Dash. Thank you for your hard work!
The backend has also evolved massively from Hbase to the cloud-powered data warehouses. We have the ability to ingest and query petabytes with single-second delays now. There's also on-demand querying like Presto/Drill/Dremio, ETL systems like CBT, and the growing space of "data lineage" for seeing how data is connected and has evolved over time.
The real issue has always been the organizational problems of larger teams and companies as data gets split into multiple silos and needs ETL and cleanup before it's useful. The new abilities we have gained have increased the complexity and scale which can lead to new challenges, but the tools are definitely getting better every day.
If you have to write code for this stuff 90% of people won't be able to analyze things, and 90% of analyses won't happen because they won't be worth the time.
I found the same thing with MicroStrategy. I spent a lot of time reverse engineering what I could from MicroStrategy jars to expose additional functions in their plugin interface (which is so incomplete it shouldn't be advertised). But the reality is its 20+ year old system with front end updates, can only put so many band aids on it.
I think the only thing keeping MicroStategy alive is its cube functionality and the businesses who have invested to much into it.
Its a javascript graphing library.
We've started using it at work a bit for our online dynamic graphs. Its a graphic grammar on top of the d3 js library.
We seem to prefer the lite version, which is simplier.
If you look at the examples, you can click a button a go to a dynamic editor which we rather like.[1]
Vega is nice for its declarative, portable nature. Also, it makes certain things very easy such as embedding interactive Vega charts and generating images of charts without needing something like PhantomJS.
Hey - I'm hacking on a product which allows you to build interactive browser-based apps/dashboards from Python and dataset blocks. You can return altair (vega) plots from your Python and they'll get rendered in the app, and we're just adding the ability to import existing Jupyter notebooks this week.
It's still really early, but feel free to have a play and create an app. Here is an example app examining using the Prophet forecasting library: https://nstack.com/apps/rdA647Q/
I'd love any feedback, and if you'd like to chat to learn more, reach out to me on leo@nstack.com.
I can relate. The sales pitch is still “so easy anyone can do it”. As a result, corporate money is thrown after tools that enable business folks to build dashboards...which isn’t a good use of their time. :). In truth, they need good data folks working with them. I guess that doesn’t make for a sexy product sales pitch.
I think the counterargument is that it's not a good use of their time to convince someone to build it for them, when they can finally do it themselves in half the time :-)
Lindy effect. If it has survived that long it probably will survive in future too. Unless the quality of individuals working in non-tech goes significantly up in tech savvyness I find it hard that it will change.
When I look at Jupyter notebooks, I ask myself why is anyone bother with ootb reporting tools or growing their own? Jupyter notebooks are the right mix of customizable and self-documenting.
The kinds of analysis and visualizations you can build in 5 minutes with Tableau (and the ability to explore the space of possible analyses and visualizations) would take hours of futzing to reproduce with python.
You can build a basic report in 5 minutes. Then you’ll spend hours tweaking it and making all the changes the boss wants. And then hours more discovering there’s things you simply can’t do (but you’re boss won’t accept that).
And next week you will have to do it all again, because it’s all manual.
How many business users are going to use Juypter notebooks? Meanwhile, someone with basic computing experience can create in-depth reports with Tableau in less than an hour.
> How many business users are going to use Juypter notebooks?
None, because it's too much programming for IT to let business people have access to it, and it's not disguised as an office productivity app the way Excel is.
If they had access to it and had basic training on it that anyone already competent in any vaguely quantitative domain could handle, plenty of them could and would.
At least judging by my experience with SQL shells and similar told that are both less powerful and less friendly than Jupyter + Python, and yet plenty of business people used them productively in enterprise environments (often right up until IT ripped it from their hands.)
Python is making some inroads in shops that have been using SAS and/or R, but it’s a hard sell. Jupyter is even harder to push because of the server element.
SAS and R are pretty common business user tools that are basically the same thing as Juypter. Tableau is nice, but you generally need to use something else to prep data for it whether that's an ETL process IT sets up or something manual that an analyst publishes.
The backend stack of both make development like swimming in glue. The same managers you are referring to cut us over to these tools and we subsequently experienced a massive decline in our business.
Are you referring to APM/infrastructure monitoring, etc.? I was curious to find out if you're thinking BI capabilities need to be just as powerful at the backend as these frontend tools.
In business analytics backend > frontend. If you are a business analyst you want all your data at your finger tips + the ability of running somewhat complex transformations and calculations. You also need to be able to validate data with your own rules, because you can't always trust data from warehouses and other sources. Then you want to take all your work and schedule it, so it runs automatically and refreshes all the reports depending on it. Finally, you want triggers on data changes/updates and good looking email notifications with screenshots, attachments, etc.. Of all of the above, end users might only see the dashboards, pdf's or Excel's on the frontend part. If you have a solid backend, you will be able to serve your customers at multiple times the speed and with a lot more insightful reports. At that point the difference between a Tableau dashboard (which does look good) and, say, Superset (which does not look as good) is somewhat negligible.
Yeah, it's certainly more than just a CRUD, and I agree it's awful. I never had to use it in the early years, so I don't know if it's that their original platform was poorly made, or that the accumulation of bloat & bolt-ons is what has made it this way.
Yeah but to someone who is in to working directly with RDBMSs, Salesforce is 100% front end. It's like how CEs think C is high level, but web developers think C is low level.
> The goal of the Vega project is to promote an ecosystem of usable and interoperable tools, supporting use cases ranging from exploratory data analysis to effective communication via custom visualization design.[0]
I think the development of Python and Jupyter and other less known things like Vega are much more interesting. Python is today the only "glue code" that puts all of it together, from data to insights.