Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Natural Language to SQL "Text-to-SQL" API (dataherald.com)
62 points by saigal on Feb 14, 2024 | hide | past | favorite | 40 comments
Hi HN- Today, we are releasing the hosted API for our natural language to SQL engine, which allows you to:

(1) Explain Your Data: Feed in dictionaries, dbt, schemas, Confluence docs - we'll understand the business context to your data.

(2) Train Your AI: Fine-tune an LLM (including GPT-4) specifically for your data, increasing accuracy and lowering latency

(3) Trust the Answer: See confidence scores with each AI-generated query, stay in control.

(4) Conduct complex SQL queries

Problem background - Developers struggle to build NL-to-SQL into products because LLMs do not work out-of-the-box; they lack metadata and business definitions. Existing NL-to-SQL tools struggle with context, complexity, and adapting to your data.

For example, given the question “what was the average rent in Los Angeles in May 2023?” a reasonable human would either assume the question is about Los Angeles, CA or would confirm the state with the question asker in a follow up. However, an LLM translates this to:

  select price from rent_prices where city=”Los Angeles” AND month=”05” AND year=”2023”
Dataherald integrates with major data warehouses, including PostgreSQL, Databricks, Snowflake, BigQuery, and DuckDB.

You can try it now free – no fees, no credit card, no sales pitches, just get the API key and get going. Let us know if it works for you, even your complex queries. (https://console.dataherald.ai/playground)

While the open source version works just fine (https://github.com/Dataherald/dataherald), the hosted API might be a better fit for those looking for: (1) someone else to take care of infrastructure setup, (2) access to an Admin UI console where you can configure and monitor performance, and (3) ability to invite team members to a project.

We're looking for feedback, particularly from anyone who can compare this performance to other NL-to-SQL products. Share your thoughts and join the conversation For more background on the release: https://www.dataherald.com/news/introducing-dhai




> what was the average rent in Los Angeles in May 2023?

This is actually an example of why I don't think LLM SQL generators are actually going to be that valuable to most companies. The biggest hurdle for people to get "value" from their data is rarely the SQL part - it's actually knowing what the question is trying to answer. When someone asks that question, do they mean the average rent of new listings from that month? Rent paid by all renters? All sizes and building types? These are all "right" answers, just to very different questions. And no amount of data dictionaries, dbt models, or other context can help narrow that down.

(Not trying to take anything away from this team, who have built a pretty well made product. I just think their approach doesn't quite get to the underlying business problem.)


This is a good point and one that we’ve grappled with a lot. My point of view after working on 30+ company deployments is that it depends on the use case. Here’s an example: you have a very analytical BizOps lead at a mid market company. He/she is smart, knows data, can do pivot tables, the whole shabang. But every time they need to get data that isn’t in the BI tool they issue a jira ticket for the data analyst to process / fetch because they don’t know SQL well enough.

The dataherald engine is perfect. They are able to use natural language, admittedly with a few iterations, to get the cut of data that they seek. This cuts down hours, or even days from their inquiry.


I think fine-tuning is too inflexible and costly for working with something as versatile as database schemas, I would recommend looking into RAG, e.g. https://www.sqlai.ai/posts/enhancing-ai-accuracy-for-sql-gen....


Hi -- we do use Fine-tuning together with RAG. To get best in class performance for NL-to-SQL you definitely need to combine both. The good folk at OpenAI dove into this during the last dev-day https://youtu.be/ahnGLM-RC1Y?si=7fv_JTScpBR9lK1R&t=2370


This looks very polished, and a more self-serve experience than what some other companies have built.

Seeing what has happened with Assistants API, do you expect OpenAI to soon introduce an SQL API as well?

As someone in the position to evaluate integrating a text-to-SQL pipeline in our product, I'm left wondering "why not just wait until OpenAI does this?" Especially when you consider the pipeline ends at OpenAI's model anyway. How long will the current crop of productized text-to-SQL pipelines really be around for?


Thanks so much for the kind words.

Getting NL-to-SQL to work at the enterprise grade takes work and a specialized engine / agent to do so (we use fine-tuning together with RAG). I don't believe this will be the objective of OpenAI. Additionally, there are plenty of companies who will want to use a LLM not governed by OpenAI (e.g. another open source model). Dataherald will ultimately allow anyone to swap out the LLM of their choosing. At the moment however GPT4 performs far and above any other model that we've tested.


Hi dataherald team. Firstly, congratulations on the launch. I’ve been in the data space for a while and know all the effort it must be to get this up and running.

As a consultant, I’m considering this as part of an offering to a client. Was hoping to chat more about the security/privacy and operational model of the hosted vs open-source offerings. Who should I reach out to?


Anuj[at]dataherald.com


What do you mean by complex SQL? How complex?


I've been playing w/ this product self-hosted for a few weeks. It can join across multiple tables, windowing functions. I haven't tried self-joins yet, nor I have put much effort into tuning w/ Golden SQL or other documentation.

I would put this at the skill level of a junior data engineer. It's pretty impressive.


Glad you have been liking it. Feel free to reach out at amir (at) dataherald.com if you need any additional help setting up.


For large databases, LLMs do not perform well if you pass the entire schema (either run into context window issues or confuse the LLM with too much info). There is a schema linking step that identifies the relevant schema and only passes that. The schema linking is also done in the fine-tuning process.


Good, but.. What do you mean by complex SQL? How complex?


The largest we have successfully deployed is on the OSQuery schema https://osquery.io/ which is 277 tables and lots of business context (malwares, vulnerabilities, Windows registry keys, etc).


>tell me how much data I have in my largest table and show me all of the data

``` select * from dbo.bigdata ```

three rows down

FUCKING HELL, OUR SQL SERVER IS LOCKED UP!!!!!!!!!!!!!


While the agent does execute the query to recover from errors, the SQLalchemy call execution is limited to a few rows only so if the server is locked there is probably something else going on ;)


Can I trust the generated output SQL? I want to turn something like this into a frontend for my customers to use.


Hey man, I'm building something in this area, but focused on building specific functions on an iterative basis, skim through the video here https://v2.connectedflow.app/ , is this something that you'd use? (Mind you, we're 2 weeks into it)


Yes, absolutely. Every AI-generated SQL comes with a confidence score, so that you stay in control. We've had people who set a confidence threshold for returning answers to the users. If confidence threshold isn't met, then there's human-in-the-loop.


I see. Just so I'm clear because this is important, if I set the confidence threshold very high, does it mean that my customers cannot create a malicious query? I don't want them deleting data or accessing rows that they're not allowed to.


All DML commands are blocked by the engine. You can wrap the returned SQL in a CTE only passing the rows the customer is allowed to access.


Wouldn't the AI-generated query need knowledge of the CTE that will be wrapping it? How would the CTE prevent arbitrary joins, or access to tables that use the fully-qualified `schema.table`? And couldn't somebody execute any arbitrary function on the SQL server? Example `pg_sleep(9999999)`.


You could set a low query execution timeout for the session.


It's an incomplete solution.


How does someone take structured data from a database and turn it into unstructured data for an LLM? I know RAG is a thing but they talk in terms of documents not tables and fields.

Could the LLM derive that same California sales answer from just knowing the dataset, but not by writing sql?


Rather than converting the entirety of structured data into unstructured text, we opt to provide the language model with sample rows and the database schema. Utilizing this approach, the LLM is then tasked with generating a SQL query to address the posed question. An alternative strategy, as you've suggested, involves transforming all structured data into textual format and subsequently training an LLM to directly respond to queries. This method, however, presents several challenges:

Linearizing structured data into text risks losing the inherent organization by columns and rows, potentially obfuscating the structured information.

The amount of information encapsulated within large tables poses significant challenges for training or fine-tuning models, attributable to the extensive number of tokens required, which may prove prohibitive for many due to resource constraints.


The given example will return the list of prices, not the average.


Comment needs to be higher up!


Is it possible to use your own LLM models with Dataherald?


this is on roadmap. which LLM are you focused on using? so far GPT 4 has performed best, head and shoulders above other LLMs, so as a result we are using this one now.


SqlCoder claims to perform better than OpenAI.


Do you keep customer query data? Curious about what data you retain of your customers.


We do store golden sql, query history and schemas. Of course everything is encrypted at rest and in transit.


FYI, the sign up form does not play well with a leading password manager.


can you tell me more?


The submitted link results in a login wall. It would probably be worth updating the submission to the introducing-dhai page that's mentioned all the way at the end here, as getting greeted by a login prompt that way isn't going to entice people to explore or learn about the product.


We've switched from https://console.dataherald.ai/playground by OP's request.


i'll try to change it out the link


The API introduction announcement is here: https://www.dataherald.com/news/introducing-dhai


If you don't want to integrate an API, and want an interface for everyone to use that _just works_ with charting check out patterns.app




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: