We spin up a data lake and pipelines (we support 500+ integrations / connectors) to populate the data lake for you then put DuckDB on top as a single query engine to access all your data.
This is really interesting. At my previous company, I built a data lakehouse for operational reporting with recency prioritization (query only recent data, archive the rest). While there was no LLM integration when I left, I've learned from former colleagues that they've since added a lightweight LLM layer on top (though I suspect Dustt's implementation is more comprehensive).
Our main requirement was querying recent operational data across daily/weekly/monthly/quarterly timeframes. The data sources included OLTP binlogs, OLAP views, SFDC, and about 15 other marketing platforms. We implemented a datalake with our own query and archival layers. This approach worked well for queries like "conversion rate per channel this quarter" where we needed broad data coverage (all 17 integrations) but manageable depth (reasonable row scanned).
This architecture also enabled quick solutions for additional use cases, like on-the-fly SFDC data enrichment that our analytics team could handle independently. Later, I learned the team integrated LLMs as they began dumping OLAP views inside the datalake for different query types, and eventually replaced our original query layer with DuckDB.
I believe approaches like these (what I had done as in house solution and what definite may be doing more extensively) are data and query-pattern focused first. While it might initially seem like overkill, this approach can withstand organizational complexity challenges - with LLMs serving primarily as an interpretation layer. From skimming the Dustt blog, their approach is refreshing, though it seems their product was built primarily for LLM integration rather than focusing first on data management and scale. They likely have internal mechanisms to handle various use cases that weren't detailed in the blog.
We spin up a data lake and pipelines (we support 500+ integrations / connectors) to populate the data lake for you then put DuckDB on top as a single query engine to access all your data.