I'm concerned that OpenAI's example documentation suggests using this to A) cons...

abhibeckert · on June 15, 2023

In my opinion the only way to use it safely is to ensure your AI only has access to data that the end user already has access to.

At that point, prompt injection is no-longer an issue - because the AI doesn't need to hide anything.

Giving GPT access to your entire database, but telling it not to reveal certain bits, is never going to work. There will always be side channel vulnerabilities in those systems.

danShumway · on June 15, 2023

> e.g. define a function called extract_data(name: string, birthday: string), or sql_query(query: string)

This section in OpenAI's product announcement really irritates me because it's so obvious that the model should have access to a subset of API calls that themselves fetch the data, as opposed to giving the model raw access to SQL. You could have the same capabilities while eliminating a huge amount of risk. And OpenAI just sticks this right in the announcement, they're encouraging it.

When I'm building a completely isolated backend with just regular code, I still usually put a data access layer in front of the database in most cases. I still don't want my REST endpoints directly building SQL queries or directly accessing the database, and that's without an LLM in the loop at all. It's just safer.

It's the same idea as using `innerHTML`; in general it's better when possible to have those kinds of calls extremely isolated and to go through functions that constrain what can go wrong. But no, OpenAI just straight up telling developers to do the wrong things and to give GPT unrestricted database access.

jmull · on June 15, 2023

SQL doesn’t necessarily have to mean full database access.

I known it’s pretty common to have apps connect to a database with a db user with full access to do anything, but that’s definitely not the only way.

If you’re interested in being safer, it’s worth learning the security features built in to your database.

danShumway · on June 15, 2023

> If you’re interested in being safer, it’s worth learning the security features built in to your database.

The problem isn't that there's no way to be safe, the problem is that OpenAI's documentation does not do anything to discourage developers from implementing this in the most dangerous way possible. Like you suggest, the most common way this will be implemented is via a db user with full access to do anything.

Developers would be far more likely to implement this safely if they were discouraged from using direct SQL queries. Developers who know how to safely add SQL queries will still know how to do that -- but developers who are copying and pasting code or thinking naively "can't I just feed my schema into GPT" should be pushed towards an implementation that's harder to mess up.

jmull · on June 15, 2023

It's hard for me to believe openai's documentation will have any effect on developers who write or copy-and-paste data access code without regard to security, no matter what it says.

If you provide an API or other external access to app data and the app data contains anything not everyone should be able to access freely then your API has to implement some kind of access control. It really doesn't matter if your API is SQL-based, REST-based, or whatever.

A SQL-based API isn't inherently less secure than a non-SQL-based one if you implement access control, and a non-SQL-based API isn't inherently more secure than a SQL-based one if you don't implement access control. The SQL-ness of an API doesn't change the security picture.

danShumway · on June 15, 2023

> If you provide an API or other external access to app data and the app data contains anything not everyone should be able to access freely then your API has to implement some kind of access control. It really doesn't matter if your API is SQL-based, REST-based, or whatever.

I don't think that's the way developers are going to interact with GPT at all, I don't think they're looking at this as if it's external access. OpenAI's documentation makes it feel like a system library or dependency, even though it's clearly not.

I'll go out on a limb, I suspect a pretty sizable chunk (if not an outright majority) of the devs who try to build on this will not be thinking about the fact that they need access controls at all.

> A SQL-based API isn't inherently less secure than a non-SQL-based one if you implement access control, and a non-SQL-based API isn't inherently more secure than a SQL-based one if you don't implement access control. The SQL-ness of an API doesn't change the security picture.

I'm not sure I agree with this either. If I see a dev exposing direct query access to a database, my reaction is going to be very dependent on whether or not I think they're an experienced programmer already. If I know them enough to trust them, fine. Otherwise, my assumption is that they're probably doing something dangerous. I think the access controls that are built into SQL are a lot easier to foot-gun, I generally advise devs to build wrappers because I think it's generally harder to mess them up. Opinion me :shrug:

Regardless, I do think the way OpenAI talks about this does matter, I do think their documentation will influence how developers use the product, so I think if they're going to talk about SQL they should in-code be showing examples of how to implement those access controls. "We're just providing the API, if developers mess it up its their fault" -- I don't know, good APIs and good documentation should try to when possible provide a "pit of success[0]" for naive developers. In particular I think that matters when talking about a market segment that is getting a lot of naive VC money thrown at it sometimes without a lot of diligence, and where those security risks may end up impacting regular people.

[0]: https://blog.codinghorror.com/falling-into-the-pit-of-succes...

BoorishBears · on June 15, 2023

You don't need to directly run the query it returns, you can use that query as a sub-query on a known safe set of data and let it fail if someone manages to prompt inject their way into looking at other tables/columns.

That way you can support natural language to query without sending dozens of functions (which will eat up the context window)

danShumway · on June 15, 2023

You can do that (I wouldn't advise it, there are still problems that are better solved by building explicit functions; but you can use subqueries and it would be safer) -- but most developers won't. They'll run the query directly. Most developers also will not execute it as a readonly query, they'll give the LLM write access to the database.

If OpenAI doesn't know that, then I don't know what to say, they haven't spent enough time writing documentation for general users.

BoorishBears · on June 15, 2023

You can't advise for or against it without a well defined problem: for some cases explicit functions won't even be an option.

Defining basic CRUD functions for a few basic entities will a ton of tokens in schema definitions, and still suffers from injection if you want to support querying on data that wasn't well defined a-priori, which is a problem I've worked on.

Overall if this was one of their example projects I'd be disappointed, but it was a snippet in a release note. So far their actual example projects have done a fair job showing where guardrails in production systems are needed, I wouldn't over-index on this.

danShumway · on June 15, 2023

> You can't advise for or against it without a well defined problem: for some cases explicit functions won't even be an option.

On average I think I can. I mean, I can't know without the exact problem specifications whether or not a developer should use `innerHTML`/`eval`. But I can offer general advice against it, even though both can be used securely. I feel pretty safe saying that exposing SQL access directly in an API will usually lead to more fragile infrastructure. There are plenty of exceptions of course, but there are exceptions to pretty much all programming advice. I don't think it's good for it to be one of the first examples they bring up for how to use the API.

----

> Overall if this was one of their example projects I'd be disappointed

I have similar complaints about their example code. They include the comment:

> # Note: the JSON response from the model may not be valid JSON

But they don't actually do schema validation here or check anything. Their example project isn't fit to deploy. My thought on this is that if every response for practically every project needs to have schema validation (and I would strongly advise doing schema validation on every response), then the sample code should have schema validation in it. Their example project should be something that could be almost copy-and-pasted.

If that makes the code sample longer, well... that is the minimum complexity to build an app on this. The sample code should reflect that.

> and still suffers from injection if you want to support querying on data that wasn't well defined a-priori

This is a really good point. My response would be that they should be expanding on this as well. I'm really frustrated that OpenAI's documentation provides (imo) basically no really practical/great security advice other than "hey, this problem exists, make sure you deal with it." But it seems to me like they're already falling over on providing good documentation before they even get to the point where they can talk seriously about bigger security decisions.

jacobr1 · on June 15, 2023

> your AI only has access to data that the end user already has access to.

That doesn't work for the same reason you mention with a DB ... any data source is vulnerable to indirect injection attacks. If you open the door to ANY data source this a factor, including ones under the sole "control" of the user.

cachehit · on June 15, 2023

>At that point, prompt injection is no-longer an issue [...]

As far as input goes, yes. But I am more worried about agents that can take actions that affect the outside world, like sending emails on your behalf.

sillysaurusx · on June 15, 2023

I was going to say “I look forward to it and think it’s hilarious,” but then I remembered that most victims will be people learning to code, not companies. It would really suck to suddenly lose your recipe database when you just wanted to figure out how this programming stuff worked.

Some kind of “heads up” tagline is probably a good idea, yeah.

cachehit · on June 15, 2023

I think the victims will mostly be the users of the software. The personal assistant that can handle your calendar and emails and all would be able to do real damage.

irthomasthomas · on June 15, 2023

I don't understand why they have done this? Like, how did the conversations go when it was pointed out to them what a pretty darn bad idea it was to recommend connecting chatgpt directly to a SQL database?

I know we are supposed to assume incompetence over malice, but no one is that incompetent. They must have had the conversations, and chose to do it anyway.

sebzim4500 · on June 15, 2023

Why is this unreasonable to you? I can imagine using this, just run it with read access and check the sql if the results are interesting.

irthomasthomas · on June 15, 2023

Even read only. You are giving access to your data to a black box API.

sebzim4500 · on June 15, 2023

If it's on Azure anyway I don't see the big deal, especially if you are an enterprise and so buying it via azure instead of directly.

blitzar · on June 15, 2023

Perhaps they plan on having ChatGPT make a quick copy of your database, for your convenience of course.