Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> This abstraction shields users from the complexity of the underlying systems and allows us to add new data sources without changing the user experience.

Cursed mission. These sorts of things do work amazingly well for toy problem domains. But, once you get into more complex business involving 4-way+ joins, things go sideways fast.

I think it might be possible to have a human in the loop during the SQL authoring phase, but there's no way you can do it clean without outside interaction in all cases.

95% correct might sound amazing at first, but it might as well be 0% in practice. You need to be perfectly correct when working with data in bulk with SQL operations.



A bit of a joke here:

It solves 100% cases where some manager requests dashboard never to look at it again after one day.


There's no smoke without fire.


> A bit of a joke here:

> It solves 100% cases where some manager requests dashboard never to look at it again after one day.

Better than having engineers build it for days (for the same result).


Using a semantic layer is the cleanest way to have a human in the loop. A human can validate and create all important metrics (e.g. what does "monthly active users" really mean) then an LLM can use that metric definition whenever asked for MAU.

With a semantic layer, you get the added benefit of writing queries in JSON instead of raw SQL. LLM's are much more consistent at writing a small JSON vs. hundreds of lines of SQL.

We[0] use cube[1] for this. It's the best open source semantic layer, but there's a couple closed source options too.

0 - https://www.definite.app/

1 - https://cube.dev/


Currently exploring cube for a "natural language to SQL" solution.

My schema is - 90+ Tables, 2500+ Columns, well documented

From your experience, does Cube look a fit? My use cases will definitely have JOINS.


yes, that shouldn't be a problem.

with that many tables, you might want to use Views: https://cube.dev/docs/reference/data-model/view


Thanks. sorry, asking more question - Do we need human in the loop with Cube to define the views for all kinds of queries.

In my use case, it's going to be exposed to various kind of stakeholders and there will be versatility of user queries. I can't pre-create views/aggregations for all scenarios.


JSON generation against a semantic layer and validation loops is definitely the easiest way to get high 9s success for going directly to a correct query from text. For the human in the loop cases, going directly to SQL can be fun - I’ve toyed with a SQL-like semantic layer that removes the need for direct table access and joins, which removes two of the risk points for LLMs going off the rails while still leveraging a lot of the baked in knowledge about SQL syntax (window functions, transformations, etc) that can be hard to exhaustively bake into the semantic layer. (It’s annoying when the semantic layer doesn’t quite have what you need.)


duckdb has a `json_serialize_sql` function that we've been messing with. It could be an interesting middle ground. It lets you write SQL as JSON.


Completely agree! A semantic layer is essential for scaling analytics to enterprise complexity.

Another alternative here is Veezoo [0], which combines the semantic layer (a Knowledge Graph) and self-service analytics into one integrated solution.

We built it specifically for the analytics use-case for both the "data persona" to manage the semantic layer, as well as for the "business persona" to analyze the data.

If you’re looking for a semantic-layer + (embedded) BI solution right out of the box. This could be a fit.

0 - https://www.veezoo.com


How well is that working for you?

We use a pattern where we ETL things into tables that model the upstream source closely, then use SQL Views to tighten up the model and integrate across data sources where needed. Keeping this all inside one DB allows us to use tools that understand the schema for autocomplete, etc.

I expect the developer experience would be significantly worse if we started writing views in YAML instead of SQL… but you’ve found the opposite?


Do you care about self-serve analytics (i.e. allowing people that don't know SQL to explore your data)?

A semantic layer is the best way to enable self-serve analytics, but if you don't care about it, it's probably not worth the hassle.

We also use the semantic layer for other nice things like setting goals, column descriptions and other metadata.


Hi, you are right that things can go sideways fast. In practice, the data that the typical employee needs is also quite simple. So there is definitely a very nice fit for this kind of product with a large number of use-case that we do see provide a lot of value internally for employees (self access to data) and data scientist (reducing loads).

For complex queries/use-cases, we generally instead push our users to create agents that assist them in shaping SQL directly, instead of going directly from text to result/graphs. Pushes them to think more about correctness while still saving them tone of time (the agent has access to the table schemas etc...), but not a good fit for non technical people of course.


This works well until it doesn’t. As long as there is someone who is responsible for the data correctness. E.g. the cardinality of two joining tables maintained cardinality instead of: there’s currently no one in the system in with two locations in the employee_to_location table so it works right now. One it happens there will be the wrong employee count from this query


It does require writing good instructions for the LLM to properly use the tables, and it works best if you carefully pick the tables that your agent is allowed to use beforehand. We have many users that use it for every day work with real data (definitely not toy problems).


If only we had a language to accurately describe what we want to retrieve from the database! Alas, one can only dream!


> It does require writing good instructions for the LLM to properly use the tables

--- start quote ---

prompt engineering is nothing but an attempt to reverse-engineer a non-deterministic black box for which any of the parameters below are unknown:

- training set

- weights

- constraints on the model

- layers between you and the model that transform both your input and the model's output that can change at any time

- availability of compute for your specific query

- and definitely some more details I haven't thought of

https://dmitriid.com/prompting-llms-is-not-engineering

--- end quote ---


What else is engineering then if not taming the unknown and the unknowable? How is building a bridge any different? Do you know everything in advance about the composition of terrain, the traffic, the wind and the earthquakes? Or are you making educated guesses about unknown quantities to get something that fits into some parameters that are Good Enough(TM) for the given purpose?


> and the unknowable

This is the crux. Sure, for high level software (e.g. Web apps), many parts of the system will feel like black boxes, but low-level software does not generally have this problem. Sure, sometimes you have to deal with a binary blob driver, but more often than not you're in control of or and to debug most all of the software running on your system.

> Building a bridge

There should NOT be significant unknowns when you're building a bridge, this is how people die. You turn those parameters into "knowns with high confidence", which is not something you can even begin to do for the LLM parameters described above.


> How is building a bridge any different?

In absolutely every way that matters and in all the details that don't matter.

> Do you know everything in advance about the composition of terrain, the traffic, the wind and the earthquakes?

No, and there are established procedures and ways to establish those facts.

"This magical incantation that I pretend works better because the US is asleep and more compute is available" is not such a procedure.


Yes you are perfectly right. Our product pushes users to be selective on the tables they give access to a given agent for a given use-case :+1:

The tricky part is correctly supporting multiple systems which each have their own specificity. All the way to Salesforce which is an entirely different beast in terms of query language. We're working on it right now and will likely follow-up with a blog post there :+1:


Salesforce architect here (from partner firm, not the mothership directly)--Salesforce's query language, SOQL, is definitely a different beast as you say. I'd like to learn more about the issues you're having with the integration, specifically the permissions enforcement. I may be misunderstanding what you meant in the blog post, but if you're passing a SOQL query through the REST API then the results will be scoped by default to the permissions of the user that went through the OAuth flow. My email is in my profile if you're open to connecting.


some people say agents can help. but still difficult to pull it without robust human in the loop: https://medium.com/thoughts-on-machine-learning/build-a-text...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: