Hi HN- Today, we are releasing the hosted API for our natural language to SQL engine, which allows you to:
(1) Explain Your Data: Feed in dictionaries, dbt, schemas, Confluence docs - we'll understand the business context to your data.
(2) Train Your AI: Fine-tune an LLM (including GPT-4) specifically for your data, increasing accuracy and lowering latency
(3) Trust the Answer: See confidence scores with each AI-generated query, stay in control.
(4) Conduct complex SQL queries
Problem background - Developers struggle to build NL-to-SQL into products because LLMs do not work out-of-the-box; they lack metadata and business definitions. Existing NL-to-SQL tools struggle with context, complexity, and adapting to your data.
For example, given the question “what was the average rent in Los Angeles in May 2023?” a reasonable human would either assume the question is about Los Angeles, CA or would confirm the state with the question asker in a follow up. However, an LLM translates this to:
select price from rent_prices where city=”Los Angeles” AND month=”05” AND year=”2023”
Dataherald integrates with major data warehouses, including PostgreSQL, Databricks, Snowflake, BigQuery, and DuckDB.
You can try it now free – no fees, no credit card, no sales pitches, just get the API key and get going. Let us know if it works for you, even your complex queries. (https://console.dataherald.ai/playground)
While the open source version works just fine (https://github.com/Dataherald/dataherald), the hosted API might be a better fit for those looking for:
(1) someone else to take care of infrastructure setup,
(2) access to an Admin UI console where you can configure and monitor performance, and
(3) ability to invite team members to a project.
We're looking for feedback, particularly from anyone who can compare this performance to other NL-to-SQL products. Share your thoughts and join the conversation
For more background on the release: https://www.dataherald.com/news/introducing-dhai
This is actually an example of why I don't think LLM SQL generators are actually going to be that valuable to most companies. The biggest hurdle for people to get "value" from their data is rarely the SQL part - it's actually knowing what the question is trying to answer. When someone asks that question, do they mean the average rent of new listings from that month? Rent paid by all renters? All sizes and building types? These are all "right" answers, just to very different questions. And no amount of data dictionaries, dbt models, or other context can help narrow that down.
(Not trying to take anything away from this team, who have built a pretty well made product. I just think their approach doesn't quite get to the underlying business problem.)