Show HN: Natural Language to SQL "Text-to-SQL" API

tqi · on Feb 15, 2024

> what was the average rent in Los Angeles in May 2023?

This is actually an example of why I don't think LLM SQL generators are actually going to be that valuable to most companies. The biggest hurdle for people to get "value" from their data is rarely the SQL part - it's actually knowing what the question is trying to answer. When someone asks that question, do they mean the average rent of new listings from that month? Rent paid by all renters? All sizes and building types? These are all "right" answers, just to very different questions. And no amount of data dictionaries, dbt models, or other context can help narrow that down.

(Not trying to take anything away from this team, who have built a pretty well made product. I just think their approach doesn't quite get to the underlying business problem.)

saigal · on Feb 15, 2024

This is a good point and one that we’ve grappled with a lot. My point of view after working on 30+ company deployments is that it depends on the use case. Here’s an example: you have a very analytical BizOps lead at a mid market company. He/she is smart, knows data, can do pivot tables, the whole shabang. But every time they need to get data that isn’t in the BI tool they issue a jira ticket for the data analyst to process / fetch because they don’t know SQL well enough.

The dataherald engine is perfect. They are able to use natural language, admittedly with a few iterations, to get the cut of data that they seek. This cuts down hours, or even days from their inquiry.

l5870uoo9y · on Feb 14, 2024

I think fine-tuning is too inflexible and costly for working with something as versatile as database schemas, I would recommend looking into RAG, e.g. https://www.sqlai.ai/posts/enhancing-ai-accuracy-for-sql-gen....

aazo11 · on Feb 14, 2024

Hi -- we do use Fine-tuning together with RAG. To get best in class performance for NL-to-SQL you definitely need to combine both. The good folk at OpenAI dove into this during the last dev-day https://youtu.be/ahnGLM-RC1Y?si=7fv_JTScpBR9lK1R&t=2370

alxmng · on Feb 15, 2024

This looks very polished, and a more self-serve experience than what some other companies have built.

Seeing what has happened with Assistants API, do you expect OpenAI to soon introduce an SQL API as well?

As someone in the position to evaluate integrating a text-to-SQL pipeline in our product, I'm left wondering "why not just wait until OpenAI does this?" Especially when you consider the pipeline ends at OpenAI's model anyway. How long will the current crop of productized text-to-SQL pipelines really be around for?

saigal · on Feb 15, 2024

Thanks so much for the kind words.

Getting NL-to-SQL to work at the enterprise grade takes work and a specialized engine / agent to do so (we use fine-tuning together with RAG). I don't believe this will be the objective of OpenAI. Additionally, there are plenty of companies who will want to use a LLM not governed by OpenAI (e.g. another open source model). Dataherald will ultimately allow anyone to swap out the LLM of their choosing. At the moment however GPT4 performs far and above any other model that we've tested.

ensemblehq · on Feb 15, 2024

Hi dataherald team. Firstly, congratulations on the launch. I’ve been in the data space for a while and know all the effort it must be to get this up and running.

As a consultant, I’m considering this as part of an offering to a client. Was hoping to chat more about the security/privacy and operational model of the hosted vs open-source offerings. Who should I reach out to?

saigal · on Feb 15, 2024

Anuj[at]dataherald.com

nick_rocks · on Feb 14, 2024

What do you mean by complex SQL? How complex?

edmundsauto · on Feb 14, 2024

I've been playing w/ this product self-hosted for a few weeks. It can join across multiple tables, windowing functions. I haven't tried self-joins yet, nor I have put much effort into tuning w/ Golden SQL or other documentation.

I would put this at the skill level of a junior data engineer. It's pretty impressive.

aazo11 · on Feb 15, 2024

Glad you have been liking it. Feel free to reach out at amir (at) dataherald.com if you need any additional help setting up.

aazo11 · on Feb 14, 2024

For large databases, LLMs do not perform well if you pass the entire schema (either run into context window issues or confuse the LLM with too much info). There is a schema linking step that identifies the relevant schema and only passes that. The schema linking is also done in the fine-tuning process.

deely3 · on Feb 14, 2024

Good, but.. What do you mean by complex SQL? How complex?

aazo11 · on Feb 14, 2024

The largest we have successfully deployed is on the OSQuery schema https://osquery.io/ which is 277 tables and lots of business context (malwares, vulnerabilities, Windows registry keys, etc).

whoomp12341 · on Feb 15, 2024

>tell me how much data I have in my largest table and show me all of the data

``` select * from dbo.bigdata ```

three rows down

FUCKING HELL, OUR SQL SERVER IS LOCKED UP!!!!!!!!!!!!!

aazo11 · on Feb 15, 2024

While the agent does execute the query to recover from errors, the SQLalchemy call execution is limited to a few rows only so if the server is locked there is probably something else going on ;)

throwaway49849 · on Feb 14, 2024

Can I trust the generated output SQL? I want to turn something like this into a frontend for my customers to use.

fermisea · on Feb 15, 2024

Hey man, I'm building something in this area, but focused on building specific functions on an iterative basis, skim through the video here https://v2.connectedflow.app/ , is this something that you'd use? (Mind you, we're 2 weeks into it)

saigal · on Feb 14, 2024

Yes, absolutely. Every AI-generated SQL comes with a confidence score, so that you stay in control. We've had people who set a confidence threshold for returning answers to the users. If confidence threshold isn't met, then there's human-in-the-loop.

throwaway49849 · on Feb 14, 2024

I see. Just so I'm clear because this is important, if I set the confidence threshold very high, does it mean that my customers cannot create a malicious query? I don't want them deleting data or accessing rows that they're not allowed to.

aazo11 · on Feb 14, 2024

All DML commands are blocked by the engine. You can wrap the returned SQL in a CTE only passing the rows the customer is allowed to access.

throwaway49849 · on Feb 15, 2024

Wouldn't the AI-generated query need knowledge of the CTE that will be wrapping it? How would the CTE prevent arbitrary joins, or access to tables that use the fully-qualified `schema.table`? And couldn't somebody execute any arbitrary function on the SQL server? Example `pg_sleep(9999999)`.

moltar · on Feb 16, 2024

You could set a low query execution timeout for the session.

throwaway49849 · on Feb 17, 2024

It's an incomplete solution.

BrickTamblan · on Feb 15, 2024

How does someone take structured data from a database and turn it into unstructured data for an LLM? I know RAG is a thing but they talk in terms of documents not tables and fields.

Could the LLM derive that same California sales answer from just knowing the dataset, but not by writing sql?

MrezaPourreza · on Feb 15, 2024

Rather than converting the entirety of structured data into unstructured text, we opt to provide the language model with sample rows and the database schema. Utilizing this approach, the LLM is then tasked with generating a SQL query to address the posed question. An alternative strategy, as you've suggested, involves transforming all structured data into textual format and subsequently training an LLM to directly respond to queries. This method, however, presents several challenges:

Linearizing structured data into text risks losing the inherent organization by columns and rows, potentially obfuscating the structured information.

The amount of information encapsulated within large tables poses significant challenges for training or fine-tuning models, attributable to the extensive number of tokens required, which may prove prohibitive for many due to resource constraints.

qrios · on Feb 15, 2024

The given example will return the list of prices, not the average.

precompute · on Feb 16, 2024

Comment needs to be higher up!

moltar · on Feb 16, 2024

Is it possible to use your own LLM models with Dataherald?

saigal · on Feb 16, 2024

this is on roadmap. which LLM are you focused on using? so far GPT 4 has performed best, head and shoulders above other LLMs, so as a result we are using this one now.

moltar · on Feb 16, 2024

SqlCoder claims to perform better than OpenAI.

throw156754228 · on Feb 15, 2024

Do you keep customer query data? Curious about what data you retain of your customers.

aazo11 · on Feb 15, 2024

We do store golden sql, query history and schemas. Of course everything is encrypted at rest and in transit.

moltar · on Feb 16, 2024

FYI, the sign up form does not play well with a leading password manager.

saigal · on Feb 16, 2024

can you tell me more?

daenney · on Feb 14, 2024

The submitted link results in a login wall. It would probably be worth updating the submission to the introducing-dhai page that's mentioned all the way at the end here, as getting greeted by a login prompt that way isn't going to entice people to explore or learn about the product.

dang · on Feb 14, 2024

We've switched from https://console.dataherald.ai/playground by OP's request.

saigal · on Feb 14, 2024

i'll try to change it out the link

saigal · on Feb 14, 2024

The API introduction announcement is here: https://www.dataherald.com/news/introducing-dhai

cstanley · on Feb 14, 2024

If you don't want to integrate an API, and want an interface for everyone to use that _just works_ with charting check out patterns.app