PostgreSQL at low level: stay curious

davidw · on Dec 6, 2019

Worth it just to learn about the -k option to strace, something that appears to have been added in the last 5 years.

bogomipz · on Dec 7, 2019

Under section 11 "memory reclaim" the author states:

>"But memory request is actually used only internally e.g. to calculate oom_adj and QoS."

Could someone elaborate on this? Doesn't Kubernetes also use this to decide if a container could be scheduled on this node in the first place? Or am I missing some other obvious point the author is trying to make?

erthalion · on Dec 15, 2019

Yes, K8S uses this information for scheduling. The phrase in the text is mostly about K8S is not using this for resource isolation on cgroups level, sorry if it wasn't clear enough.

janpot · on Dec 6, 2019

Kind of unrelated and very high level thoughts, but I wish postgres had a scripting model a bit like redis. Where functions are cached under their sha1 hash and can be executed by specifying that hash. cached functions would then be evicted after not being run for a certain time. From the application side I'd just always try executing the hash of the function and create it on the fly when that errors (because doesn't exist on the db side yet). a model similar to this one would completely convert me to defining most of my app logic in plsql functions.

zxexz · on Dec 6, 2019

I wonder, could you describe your use-case a little more?

I'm curious why you couldn't write a stored procedure, then call that from your application?

If you need to see if your function exists, you can do something like this [0].

Also using pg_proc, you can get the DDL of your function e.g. [1]

  SELECT prosrc FROM pg_proc WHERE proname = 'function_name';

And you could then check for equality of the source (or its hash), if you are tied to your current application logic

[0] https://stackoverflow.com/questions/24773603/how-to-find-if-... [1] https://stackoverflow.com/questions/3524859/how-to-display-f...

mjw1007 · on Dec 6, 2019

If you do it that way, you also end up having to rename your function every time it changes, if it's possible that older clients are still running. And if you're tidy-minded you're going to spend effort building a system for removing old versions when they're not going to be needed any more.

It's all doable, but it's fiddly. Explicit support for this sort of setup would be a good thing.

gfody · on Dec 6, 2019

of all the reasons a sproc may need to change I can't imagine one where you want to keep the old version around while also giving the new version the same name as the old

derefr · on Dec 7, 2019

You can't imagine multiple client applications of different versions attempting to install, and then interact with, different versions of the same stored-proc API abstracting away (variously-polyfilled views + triggers upon) a single shared underlying dataset?

I.e. a world where "FooClient v1.1.0" wants to interact with "DB schema Core-1/FooClient-v1.1.0", while allowing "FooClient v1.0.0" to still see "DB schema Core-1/FooClient-v1.0.0", and "BarClient v1.0.0" to see an entirely separate "DB schema Core-1/BarClient-v1.0.0"?

(I mean, sure, you could have each client construct its "API layer" of sprocs et al in its own client-specific schema, or worse yet, as its own connection-specific temporary objects—but that's pretty dang inefficient when tons of different clients are sharing the same few API versions, and when the different API versions share identical copies of 99% of their sprocs.)

_lqaf · on Dec 7, 2019

Honestly, I've never worked in a shop what would have let anyone design an application where clients were allowed to write stored procedures or views back to the DB.

Similarly, I'd never let a client modify the application logic running on my app servers.

I mean, I can imagine it, but it sounds like painting "root me and do what you will with the corpse" on your front door.

derefr · on Dec 7, 2019

I can understand your point of view as a “database supremacist”—i.e. someone who thinks DBAs are priests who determine the canonical schema, where development teams (even within the same org) are supplicants to the DBA to get the schema changed (whereupon the schema is then changed for everyone else, who have to all just suck it up.) It’s a common viewpoint, and even a sensible one when managing DBMSes that aren’t “strong” enough to support the features required for virtualizing the schema (mostly: sprocs, schema-namespacing and search paths, low-overhead writable views that can “symlink” objects from one namespace into another, and user ACLs upon database objects where granting write access to a writable view means granting write access to the underlying object, while protecting the object itself.)

The alternative viewpoint, is one where a DBMS isn’t its own Data Warehouse, but is rather a “relational transformation workspace” of a Data Lake. Consider BigQuery as a DBMS. Who determines BigQuery’s schema? Nobody. Each user has their own stuff in their own namespace, and users mount stuff from one-another’s namespaces into their own namespace. The job of the people who run BigQuery (or any other multi-tenant data-warehouse infrastructure of this type) is just to enable users to collaborate with one-another through the DBMS, where Data Definition “APIs” are contracts established between pairs of clients, rather than between a developer and the DBMS’s DBA.

That was more the stance I was talking about with my parent post.

But even under a Database-supremacist viewpoint, what I was talking about can still happen. Whether clients are pushing objects into the DB or not, as long as there are requirements to not break old clients, but also requirements to evolve the schema for new clients, the DBA will arrive at a set of constraints that can only be solved by doing exactly what these clients would do themselves, if given DDL privileges: to define separate “virtual namespaces” where the same object names refer to mostly-common but some slightly-different versions of the same database objects, in order to allow LegacyClient to see PolyfilledLegacySchema while NewClients see the regular, “true” schema (or vice-versa.)

In any organization with a sufficiently-large team structure that must share data (picture e.g. Amazon with their store), there will be a number of teams (clients) with different conflicting constraints upon the same data set. This makes the job of the DBA managing their shared data, very similar to the job of an upstream library software developer: you must keep the library’s API stable so that existing downstream clients won’t break; while also making changes to those APIs such that new downstream clients can get the features they’re demanding.

Sure, there are alternate ways of solving this stable+evolving constraint problem for shared data, just like there are alternate ways of solving the stable+evolving constraint problem for a software library. You can just build a database that’s “downstream” of your own, that decorates the data with the new features (or tell the client that needs it to do it themselves!), and which syncs changes back to the upstream DB in the upstream’s canonical format. This is like building one library to wrap another. But this is, in the end, a lot of extra complexity that introduces the chance of data corruption along the way. Much safer to warehouse your data in one system with one MVCC world-clock. It might be a requirement if your upstream DBMS is “weak” (e.g. some legacy mainframe DB2 instance) but modern DBMSes give you every feature you need to provide multiple interfaces from a single instance. Why not take advantage?

———

And, if you can accept that argument—and you also accept the idea that clients should be able to use the DBMS as a scratch workspace for their queries (i.e. that creating temporary tables isn’t a perversion of what DBMSes stand for)—then combining the two results in the idea that each client could, in theory, just tell the DBMS of its particular needs for an API to access a shared dataset, and that this would result in the automatic creation of a virtual namespace perfectly suited to that client’s needs.

Sure, let’s pretend the client is specifying this virtual namespace using some sort of abstract specification language, where the DBMS takes it, parses it, and then has the power to satisfy those requirements however it wants, maybe just by directing the client to the nearest matching existing virtual namespace, etc. But think about what exactly clients would need to send in these requirements specifications for a long time, you’ll realize that the specs will still turn out to need exactly the level power of regular SQL DDL statements. You don’t need a shim layer; you just need a sufficiently-granular ACL system, such that the client can create “virtual stuff” for their own use, but can’t create or interfere with “real stuff”, or with other clients.

(And then it turns out that, since the client would just be communicating its needs via SQL DDL anyway, it makes perfect sense to let dev-teams avoid Not-Invented-Here by just reusing the perfectly good SQL-DDL-emitter that is “database migration frameworks.”)

setr · on Dec 6, 2019

The output is slightly different, but the semantic behavior remains the same? Databases and applications update independently, so versioning is required. If the database offers nothing for this, then an adhoc solution will generally be found (eg suffixing with _v1, _v2, etc), but that's not at all ideal

heavenlyblue · on Dec 6, 2019

Why is it not ideal? Literally every single other language solves it this way.

setr · on Dec 6, 2019

Highly manual, error-prone, lack of consistency means lack of tooling support, etc. Most languages simply don't handle the scenario; most languages also didn't handle code-testing, or building, or etc, but that hardly makes it an ideal situation.

Additionally, in the case of DBs, it's strange to call it a "language"; at best it's a language with a supremely massive runtime, but more appropriately, it's a platform.

And being a platform, it would be fairly reasonable and unsurprising to consider things like version control and function-versioning to be in scope, even though most DB platforms don't handle it either.

But I'm also consistently surprised by how lacking the DB tooling space is compared to programming languages; eg the formatters are generally lacking/incomplete (foolishly imagining the SQL standard to be sufficient to claim support for all DBs), the general lack of any version control, the general lack of SQL transpilers (again, probably due to the misunderstanding that the standard standardizes anything; specifically the lack of a 80-90÷ OS automatic translation for MySQL -> PGSQL is astounding), the general lack of IDE support (Most can't even assist in maintaining datatypes when writing SQL, despite the all-knowing schema available to it; you basically get notepad + a pretty view of the output), etc.

DBAs are willing to put up with quite a bit, and I can't figure out why. The engines themselves are phenomenal works of engineering; the usage of it is utterly pathetic

paulddraper · on Dec 6, 2019

> DBAs are willing to put up with quite a bit, and I can't figure out why. The engines themselves are phenomenal works of engineering; the usage of it is utterly pathetic

Perhaps they don't want their programs to bloat until they can read email.

setr · on Dec 6, 2019

Enhancing management of your relational database seems to me quite in scope of your relational database management system. Whether the functionality exists as an arbitrary other program, in whatever miniscule chunk that fits your preference, or as part of an additional arbitrary SQL extension (we would all certainly love more random keywords/syntax), is unimportant.

The functionality/tooling is missing, or at best, lacking, and has nothing to do with a preference for no-bloat. It simply means that we have arbitrary, adhoc implementations to enhance the management of our failing management system. That the database must be babied is not a positive, and RDBMS are meant to remedy that.

paulddraper · on Dec 6, 2019

You are vastly overstating the perceived deficiency.

PostgreSQL gives you functions, views, rules for query writing.

Plus it gives you ephemeral versions of all those so you don't have manage it separately from the client code.

setr · on Dec 7, 2019

>You are vastly overstating the perceived deficiency.

How so? There is no versioning support. That's it. I'm not sure how much more deficient you can get. Anti-versioning features?

>PostgreSQL gives you functions, views, rules for query writing.

Right, which requires an adhoc versioning mechanism, or coupling application + DB updates. Thankfully it was at least realized versioning your data made sense (MVCC), likely a result of how often problems occur trying to manually version it in your application logic with multiple arbitrary threads (read: humans, applications) operating on it, but somehow got lost on the rest of the system (where multiple arbitrary humans, applications fiddle with it).

>Plus it gives you ephemeral versions of all those so you don't have manage it separately from the client code.

Moving all of your database logic into your application code is very much a poor state of affairs, and defeats much of the purpose of offering functions, views, etc in the first place. Ideally, you want them to live for more than a single transaction, and be used by multiple applications, and presumably written & maintained by your DBA, who presumably knows the database engine, and its usage patterns, better than your application developer.

The ephemeral version is ideally a method of last resort (for one-off usecases); but if you want decent versioning tooling (eg git), this is your method of first resort.

Application developers learned this lesson decades ago. What I've never understood about databases is that despite the massive industry, usage, and ecosystem, their tooling is still stuck in the 80's.

paulddraper · on Dec 7, 2019

This is perhaps the most confusing comment I've ever read.

> versioning your data made sense (MVCC)

In PostgreSQL, functions can be created and versioned with MVCC exactly like data.

> Moving all of your database logic into your application code is very much a poor state of affairs

Okay, then deploy it separarely.

> written & maintained by your DBA

But isn't that the "ceremony" you railed so hard against?

> tooling is still stuck in the 80's

Rather I think you want them to do something (though I'm not exactly sure what) that most other people don't. The tooling for Apache config files is stuck in the 90s too.

> Application developers learned this lesson decades ago.

What lesson?

Is it exactly as hard to install updates to your application, and is it to install updates to your database.

Is it exactly as hard to offer a new API version from your HTTP server (/v2/) as it is from your SQL server.

setr · on Dec 7, 2019

>In PostgreSQL, functions can be created and versioned with MVCC exactly like data.

Perhaps not the best statement; I was trying to tie the notion of MVCC to function-versioning, except that function-versioning is at human-speeds, and has the same problems that lead to MVCC (but mitigated due to how slow the process is [not computer-speeds] and humans can actively intervene to manually correct)

>Okay, then deploy it separarely.

Deploying your changes isn't the problem. Tracking the changes is. The database exists as a globally available resource of statically defined functions accessible to all applications (and humans) interacting with the database. It pretends that there only ever exists one correct interpretation of the database, but in fact the database is ever-changing, and ever-racing. This model was learned not to be valid for data, and so we see "split timelines", that is, different views of the same dataset, with defined semantics on how to collapse the views. This is MVCC. But this was not replicated across the rest of the system. The data is versioned, internally, by the DB. Nothing else is. Yet everything changes.

>But isn't that the "ceremony" you railed so hard against?

No? What ceremony? Nothing exists!

My problem is that there is no defined process, and that its completely adhoc and manual, which means tooling can never properly exist -- not that it's difficult. It's error-prone; the ceremony doesn't matter. And importantly, we know a solid process can be defined (see: version control, or redis's model.)

>Is it exactly as hard to install updates to your application, and is it to install updates to your database.

Installation doesn't matter; Databases are great at updating their functions -- it just drops and replaces whatever exists! I'm talking about tracking function change over time. That the database exists in multiple states at different instances of being queried. This is automatically handled for data, but not logic.

>Is it exactly as hard to offer a new API version from your HTTP server (/v2/) as it is from your SQL server.

This is true, and I'm not happy with this either :-) You can't independently update your HTTP server and your client code for the same reason you can't independently update your HTTP server and your DB, without individually & manually tracking every function change. As a result, you're forced to couple releases of your application with releases of your HTTP server with releases of your DB changes, or track it manually (_V2).

gfody · on Dec 6, 2019

same situation as with any interface, you either break backwards compatibility or you don't - this is not a problem for databases to solve

mjw1007 · on Dec 6, 2019

The situation I'm thinking of (and I think janpot is) is where your stored procedure or function isn't intended to be a shared interface, but a part of what is notionally your client's code that you happen to want to run server-side.

DO can be good for this, but it doesn't cover all cases: sometimes you want a named function to use in an expression, and sometimes you want multiple procedures which can call each other.

Just having CREATE [OR REPLACE] TEMPORARY FUNCTION would be pleasant.

jeltz · on Dec 7, 2019

You can create temporary functions in PostgreSQL, but I have not checked the performance of using them (I think they could theoretically bloat the catalog, but I only use them in migrations). Just do something like the below:

    CREATE FUNCTION pg_temp.f(int, int) RETURNS int LANGUAGE sql AS $$ SELECT $1 + $2 $$;
    SELECT $1 + $2 $$;SELECT pg_temp.f(1 , 3);
     f 
    ---
     4
    (1 row)

mjw1007 · on Dec 7, 2019

This is in fact what I do. I'm agreeing with janpot that a richer model would be pleasant.

(In practice, when I've wanted this sort of thing it's been good enough to either install all the functions the client might want to use whenever it makes a connection, or else to use CREATE OR REPLACE FUNCTION every time it's going to run a query needing that function. But I can imagine cases where neither of those would be efficient enough and I'd end up having the client track what it's already loaded. Some help from the server would be good there.)

janpot · on Dec 7, 2019

Yep, I just found out about temporary functions today. Gonna start experimenting with that.

setr · on Dec 6, 2019

and when you do break backwards compatibility, you version.

Now how is versioning managed? The database can easily fill in here with a standardized approach, with supporting tooling.

Or it can not, in which case you come up with an adhoc versioning scheme that you maintain manually.

Same way you do in all computing scenarios when you don't have any versioning tool available to you:

AGREEMENT_v1.doc

AGREEMENT_v2.doc

AGREEMENT_v2_revised.doc

AGREMEENT_v2_Revision2.doc

AGREEMENT_v3.doc

AGREEMENT_v3_fixed.doc

or perhaps you start doing some strange behaviors to force other tools to fit in, like attempting to version your schema/functions with git, by tying it to a single application's ORM

or perhaps you stop touching the database directly, and try to make it all work with migration-scripts, and it all falls apart when a second application comes into play

or perhaps you try to force coordination between all dependent applications and your database, and make this update simultaneously

or perhaps you convince the RDB management system to manage your system

sixbrx · on Dec 7, 2019

Putting the functions in a separate schema (ie namespace) per version will prevent having to name mangle the function names themselves. Then you just manage the schema names for versioning and adjusting the search path can make it work implicitly with the proper version if that's wanted.

namibj · on Dec 6, 2019

Just hash the create statement and use that as the name? Have whatever code hashes these dump these hashes, or just use some last-used-timestamp to cull them every once in a while (though I doubt that's actually necessary, due to how i'd expect them to be stored. And when they accumulate, you cans till start to clean up by generating lists of all hashes used, waiting a week or so until all older versions have finished, and drop all stored procedures that are not in your list.

mjw1007 · on Dec 6, 2019

As far as I'm concerned, that comes squarely under the scope of "doable, but fiddly".

janpot · on Dec 6, 2019

renaming is the point. I want it to be tightly coupled to the specific version of my application.

mjw1007 · on Dec 6, 2019

For what it's worth, I recommend something based on running `CREATE FUNCTION pg_temp.foo()` once per session (and calling it explicitly as `pg_temp.foo()`).

manigandham · on Dec 6, 2019

STORED PROCEDURES: Available since PG11. Name them whatever you want, and call them with parameters. They can be written in multiple languages like PL/pgSQL which supports more constructs than standard SQL, and support defining multiple transactions internally. They are part of your schema and always available with support for security access.

PREPARED STATEMENTS: Queries that are executed often. Setup on every new session but will be automatically compiled, cached for the session lifetime, and executed with just a minimal identifier. Depends on your client driver to handle seamlessly.

More info: https://www.postgresql.org/docs/current/sql-prepare.html

SigmundA · on Dec 6, 2019

Looks like the plan caching is only per session which is a big bummer if you call the same statements over and over from multiple sessions.

Plus you have manually deal with this which is weird coming from MSSQL where it just does it with a global cache based on statement text hash all the time.

Brings me back to early 2000's when it was recommended to always use stored procedures because either MSSQL 7 or 2000 can't remember didn't cache ad-hoc statements only sprocs.

manigandham · on Dec 6, 2019

Plan caching is different from prepared statements, but yes PG is not as advanced as other commercial databases in execution.

Usually the database client drivers will handle this for you. If you're using .NET, the npgsql driver maintains a connection pool and creates prepared statements for each new session. If you have something like PGBouncer in the middle then it gets more complicated.

SigmundA · on Dec 6, 2019

There is almost no point is using prepared statements in MSSQL anymore, not for a long time. The only benefit comes from not sending the prepared statement over the wire each time and not having to hash it to do a plan lookup.

That benefit is so very minor compared to the time taken to parse and plan the statement that almost none uses it, for more details:

https://dba.stackexchange.com/questions/129659/what-is-the-s...

In PG its seems to be the only choice and on top of that isn't even shared among connections, this honestly blows my mind and will have to remember if I move anything serious over to PG.

I like PG a lot and it does a lot of thing MSSQL doesn't but this is a big missing performance piece.

manigandham · on Dec 6, 2019

I'm very familiar with both (and we use both) but plan caching isn't going to make that much of a difference for many scenarios. PG also has simpler planning than MSSQL anyway. I doubt you'd notice a difference unless you're extremely complex queries or a very high rate of transactions.

Waterluvian · on Dec 6, 2019

What is the fundamental point of these vs. Getting data, using your preferred language, and then writing data? Is it basically all about performance, or are there other gains?

didericis · on Dec 6, 2019

I dabbled with stored procedures when I was dealing with a problem that required one bit of data in the database to be in sync with another bit of data. Basically I had a list of records with time values, and I wanted to have another record store the sum of all those time values. There were lots of records, and writes didn’t happen that often, so I thought a stored procedure that was triggered when a records time value was updated would work well.

I wanted to be able to have multiple APIs wrapping the database (one GraphQL layer one that would make sql queries directly and be consumed by an iOS and web app, and another traditional REST one designed for integration with third parties that don’t support GraphQL). I could have either duplicated that logic to sum up the values across those layers or made a shared api later underneath the REST and GraphQL layers (or put the rest layer underneath the graphql layer), but those other options seemed less ideal. I wanted the API layer only to be responsible for sending data to and from the database, authentication and permissions.

That app was a side project that I lost interest in, so I never got to see the long term implications of that, but I still think stored procedures were the right call for that particular problem. As much for data consistency and layer responsibility reasons as for performance reasons.

The main difficulty with stored procedures seems to be that they’re kind of hard to test and isolate. But when used sparingly, and when you test that the effects your expecting happen on writes (even if you can’t test the procedure explicitly), I think they’re a good way to keep purely data consistency related logic inside the database.

aargh_aargh · on Dec 7, 2019

That works just fine until it doesn't. Doing all the work on the RDBMS side within a single transaction guarantees that nobody has changed the data you read before you wrote it back. If they have, either your transaction fails or theirs will, but consistency of the data is guaranteed.

janpot · on Dec 6, 2019

yes, I'm proposing stored procedure without the management headache

manigandham · on Dec 6, 2019

Ah, so temporary stored procedures that are created and cached at the connection level and lifetime. Makes sense, might be worth a feature request to the PG team.

janpot · on Dec 6, 2019

Or longer, they can exist for the lifetime of a specific version of my app. they'd why I'd cache and identify by their hash.

munk-a · on Dec 6, 2019

How are you going to identify them by hash or know which hash is significant? Why not store that determiner in the SP name and use that to pick and chose different versions based on that determiner.

Unless each binary gets its own hash (in which case you could solve it by building the SP when the package is built and then injecting it into the binary) there's some piece of information you'd use to chose which one to use - why obfuscate that information behind a hash instead of just using it directly as a descriptor?

manigandham · on Dec 6, 2019

If you're doing that then why not just use stored procedures? Prefix with a version and dropping them to cleanup is a one liner.

The only issue is dependency tracking in case you change any schemas but that's an issue with any saved code in the database.

SigmundA · on Dec 6, 2019

MS SQL Server hashes all incoming statements and looks them up in global shared cache, essentially doing what Redis does without having to deal with the hash directly.

The cache will get evicted based on time with no use and memory pressure etc.

I am actually kind of surprised to find that Postgresql doesn't do this also, wow, I thought pretty much every modern RDBMS did this.

paulddraper · on Dec 6, 2019

> MS SQL Server

That applies to queries but not more general TSQL with branching logic, etc.

> wow, I thought pretty much every modern RDBMS did this.

Both PostgreSQL and MySQL have plan caches using prepared statements. (SQL PREPARE statement)

PostgreSQL further allows tuning of whether the query plan is specialized based on parameter values.

SigmundA · on Dec 7, 2019

>That applies to queries but not more general TSQL with branching logic, etc.

That is not my understanding, all of the TSQL including procedural logic is "compiled" but it is not optimized because it is not a set based operation. However no parsing or binding needs to be done on it if reused.

The entire TSQL statement including branching logic is run through the parser and algebrizer and turned into a binary form that the optimizer can work on then cached.

>PostgreSQL further allows tuning of whether the query plan is specialized based on parameter values.

SQL server has prepared statement but they are basically unnecessary there is no need to manually tell it to prepare and reuse as it will do so on its own, it mostly there for backward compatibility.

SQL server will also optimize based on parameter values, lookup parameter sniffing, it can store multiple plans and may not use a stored plan based on parameters. It will also potentially auto parameterize sql statements that don't use parameters so they can be reused.

Again I have used sql server going back to 6.5 when it didn't cache ad-hoc queries and you had to use sprocs to get caching. During that time almost every app went through sprocs for every call because of the overhead of parsing and planning, its was like a 2x speedup.

Nowdays we have a lot more network memory and cpu, I still don't want my db server wasting it on parsing and planing the same statements over and over for no good reason. What programming language even interpreted doesn't keep some optimized ready to run version around for repeated calls and instead reparses and binds on every hit?

paulddraper · on Dec 7, 2019

> parsing and planing the same statements over and over...What programming language even interpreted...reparses and binds on every hit?

Yeah, I agree.

But I typically don't see it as a big problem because (1) queries are short so parsing is nothing (2) in some cases you'll want query plans to depend on parameter values, which hurts any cache rate (3) personally I heavily batch operations, e.g. a big insert is simply:

    INSERT INTO example
    SELECT * FROM unnest($?::int[], $?::text[])

P.S. AWS Aurora PostgreSQL has a plugin with global query plan caching and management. https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide...

SigmundA · on Dec 8, 2019

Batch inserts are not the issue in MSSQL typically you would use bulk insert instead if you really wanted performance, although plan caching helps still helps.

Where you get big wins in plan caching are more complex selects that are run repeatedly from an application on multiple connections usually from a pool of say 100. These selects with many joins take significant time to parse and plan but execute very fast. This is pretty typical in a typical business app that is read heavy calling common selects over and over, like an entity find or read with security checks.

manigandham · on Dec 6, 2019

MSSQL (and others like Oracle) have 100s of millions in research and development invested by now. PG is great for 80% of uses but it won't match the advanced performance scenarios.

SigmundA · on Dec 6, 2019

Hashing incoming text statements and storing the plan for reuse seems like an obvious optimization that does not require billions in R&D, you just copy the well known behavior of the existing databases.

Is there some complexity I am missing that would require huge R&D , like its too hard to share the plan and procedures across sessions or the hashing algo would have too much overhead?

mjw1007 · on Dec 6, 2019

AIUI, Postgres has had transactional DDL from early on, while other DBMSes didn't. That means that different Postgres sessions can see different schemas. So I can easily believe that sharing plans across sessions would have been more difficult than for other systems, yes.

There are presumably also concerns that the same text can have different meanings in different sessions because of different per-session configuration, like the search path.

Nowadays, I think it's more that the gain isn't terribly large.

Parsing SQL is pretty cheap.

Normally with PostgreSQL you re-plan for each execution anyway. If you're going to ask it to reuse plans, the overhead of planning once per session is probably small enough that reducing it further doesn't gain much.

And with modern core counts and scalability concerns, there's a lot to be said for per-process rather than global caches anyway.

munk-a · on Dec 6, 2019

Postgres can operate over transaction DDL but in most cases I think this would be a fine thing to isolate over - the more worrying case is when different transaction spaces have different understandings of row statistics - PG does use row statistics to inform planning so it's not inconceivable that sharing a plan between two queries running in vastly different data states would result in a highly inefficient query.

SigmundA · on Dec 7, 2019

MSSQL does this as well, it will evict plans based on significant statistic changes and does parameter sniffing to decide on reusing a plan.

SigmundA · on Dec 6, 2019

SQL server has had transactional DDL for as long as I remember.

MSSQL also uses per session settings as part of the cache key. This can be confusing if you think its only based on statement text and wondering why two different client are getting different plans, its because they have different settings.

The planning overhead would completely destroy some applications I have worked on. Some more complex queries execute extremely fast but take some time to plan. Its hard for me to believe planner is so much faster as to be a non issue.

Global caches are key to db performance thats why PG has shared buffers. MSSQL is single process multithreaded though while PG is process per connection so I would imagine that complicates shared data structures somewhat.

gfody · on Dec 6, 2019

the penalty for using a cached plan that is no longer optimal can be devastating, I've often had to use "with recompile" to get around this on SQL Server - I don't think global plan cache is a no-brainer, the scenario you describe is probably better served by a short-lived results cache

cogman10 · on Dec 6, 2019

Or worse, I've ran into cases where just enough "bad" parameters flow into the planner to topple the statistics on a previously well behaving stored proc.

SQL server does not recover quickly after a cached plan goes wrong. More than once, either adding "WITH RECOMPILE" or manually recompiling a SP has fixed our performance issues.

"Everyone has a plan until they get punched in the face" - Mike Tyson

SigmundA · on Dec 6, 2019

I have rarely run into this issue, but they do occur which is why WITH RECOMPILE exists. The majority of the time caching the plans is a benefit and therefore has been the default behavior with a opt out option.

With that kind a of plan instability I start looking at query hints though which PG doesn't have either...

jeltz · on Dec 7, 2019

That kind of simple plan cache you imagine has rather limited usage, because it can only handle non-parametrized queries without risking regressions in query plan (e.g. PostgreSQL may be able to find a more optimal plan for "SELECT * FROM t WHERE x = 1" than to "SELECT * FROM t WHERE x = $1", even when the parameters are sent separately in the protocol). BTW, this is also currently an issue with prepared statements in PostgreSQL, so they can be a performance loss compared to re-planning the query (there are some very simple heuristics which tries to solve it but it is far from perfect).

Sure, it is possible to apply advanced techniques to try to guess when re-planning is worth it. Alternatively you can make it an option for PostgreSQL to cache parametrized queries too and accept the possible regressions, but that becomes less interesting due to most ORMs sending ["SELECT * FROM t WHERE x = 42"] rather than ["SELECT * FROM t WHERE x = $1", 42] which means you cannot simply hash the string.

Again it is possible to do but I think you underestimate the work necessary for it to be a widely useful feature. And on top of this there are probably legacy code reasons like PostgreSQL process based architecture which makes it even more work.

manigandham · on Dec 10, 2019

SQL Server automatically replans cached queries when statistics change. It's not a complicated process too hook into stats rebuild and follow up with a concurrent replan.

Also databases are designed for SQL, not ORMs. Not using parameterized queries at this point is just archaic and there are lot more problems then query caching to deal with first. The commercial databases can also tokenize input strings and parameterize them after the fact, this is what happens when you submit things as plain text and still use a cached plan.

ako · on Dec 6, 2019

You can probably execute an anonymous plsql block, and postgres will detect for you if has already parsed that block. No need to do that hash thing yourself.

janpot · on Dec 6, 2019

you can't pass parameters to those

ako · on Dec 7, 2019

Indeed, postgres doesn’t allow this. I expected it to work, as I’ve used this quite often with Oracle.

matdehaast · on Dec 6, 2019

What’s wrong with stored procedures?

mjw1007 · on Dec 6, 2019

They have to be managed like the database schema, which tends to involve a lot of ceremony. Coping with different function versions for different client versions can be inconvenient.

In cases where you want to run functions on the server for performance (rather than for hiding some underlying structure), it's often more convenient to maintain them together with other client code.

You can do things like creating functions in pg_temp once per connection, but more upstream support for this sort of setup would indeed be pleasant.

manigandham · on Dec 6, 2019

You can use prepared statements: https://www.postgresql.org/docs/current/sql-prepare.html

mjw1007 · on Dec 6, 2019

Sometimes you can, but sometimes you want something that can be called as a function.

And there's no PREPARE IF NOT EXISTS, so you either have to track yourself whether you've already done it or grub around in pg_prepared_statements.

janpot · on Dec 6, 2019

not powerful enough, I want programming capabilities. composing statements and branching

megous · on Dec 6, 2019

DO?

https://www.postgresql.org/docs/10/sql-do.html

janpot · on Dec 6, 2019

you can't pass input to those

megous · on Dec 14, 2019

I don't see why you counldn't use prepared statements.

cogman10 · on Dec 6, 2019

I DON'T want that.

Maybe if there is a case where things are absolutely faster to do in the DB, but often times, you get better performance and scaling when you ship that stuff off to the application (it makes things easier to maintain as well).

I'm currently dealing with a legacy system where some wing-nuts decided to push all the business logic into the DB. It is really REALLY hard to improve things like performance and scaling when that happens.

janpot · on Dec 7, 2019

Thanks a lot, didn't know about temp functions. You're hitting the nail on the head with where I want to go with this. I wish postgres put more attention into this kind of interfacing layer.