What does that mean or imply?

NortySpock · on April 6, 2023

People inside a company prefer to have financial and operational information to make informed operational decisions or projections.

Sometimes that data is stored in a variety of places (excel files, a smattering of disconnected databases, sometimes enriching the data with more data pulled from an API...).

Extracting that data to a common location (historically called a data warehouse) tends to be the work of a data pipeline, and the tool used to join, display, filter, dynamically aggregate, and visualize the data has been historically categorized as a "Business Intelligence" tool. Normally these BI tools provide data caching and allow temporary integration of multiple data sources.

The intent is to make it easy for business users to explore, analyze, visualize, share, and present datasets or results.

The biggest examples off the top of my head would be PowerBI, Tableau, and Apache Superset, but "data reporting tool" is a competitive market with many entrants.

CyberDildonics · on April 6, 2023

Sounds like essentially consolidating data in a database and visualizing it. Is there a difference between a database and a "data warehouse"?

aidenn0 · on April 6, 2023

A data warehouse is a specific use-case for a database.

> Sounds like essentially consolidating data in a database and visualizing it

Yes, and the database in which you consolidate the data is called a "data warehouse" In many cases the sources of data are ... not optimal for querying. (And I mean like "Spread across a dozen excel spreadsheets that people e-mail out new versions of when they make a change" by "not optimal")

CyberDildonics · on April 6, 2023

Consolidating data makes sense and taking it from sources you can't query makes sense, but once you put it into a source you can query, that sounds like a database. I'm not sure why there is a different term here.

adamc · on April 6, 2023

The database is often arranged and optimized differently than one used to support operations, and a lot of things are often done to make what would otherwise be long-running, slow queries faster -- e.g., aggregating or otherwise summarizing data likely to be queried in reports. But sure, it's still backed by a database.

aidenn0 · on April 6, 2023

We have multiple names at different levels of specificity for things all the time. I have a laptop and a desktop, but they are both "computers"

pnut · on April 6, 2023

Yes, they have fundamentally different architectures to serve their respective use cases.

CyberDildonics · on April 6, 2023

What is the different architecture and what is the different use case? Databases are already general tools. You put data in and query it. At what point does it become a "data warehouse" ?

NortySpock · on April 6, 2023

People argue over the definition of a data warehouse, but usually a data warehouse

(1) optimized for bulk inserts and bulk read operations, not updates or deletes, nor for single row updates

(2) stores data from multiple sources

(3) often a "columnstore" columnar table format is used. This format usually compresses columns in a format that is trivial to decompress (so if a column only contains a handful of similar flags, the database will run-length-encode them to optimize a full column scan), and additionally if you limit the number of columns accessed you can substantially reduce the cost of the query

(4) the data warehouse can easily shard or partition data

If you want to technical file format examples, I suggest looking up how the Parquet file format compresses data, or how Microsoft SQL Server columnstore tables work under-the-hood.

wintogreen74 · on April 6, 2023

they're both general terms, but serve different purposes. Data Warehouses tend to focus on read-specific operations, more of an emphasis on historical and/or analytical problems than transactional ones. Subsequently they are likely to optimize data ingress, distributed queries and downplay referential integrity and consistency. The different workloads promote different compute models, but you're right that the line is blurry. Similarly you could ask "Excel has multiple tables (sheets) that you can join and query; is it a database"?