Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
When should you use the IN instead of the OR operator in Postgres? (ottertune.com)
100 points by angry__panda on Aug 31, 2023 | hide | past | favorite | 49 comments


Seems strange the article doesn't mention the `= ANY(array)`[0] alternative to `IN (?)`. The problem with `IN` is that PostgreSQL uses a slow parser for the subquery, since `IN` is an SQL standard, and a legacy parser is used. You can see in the EXPLAIN, that PostgreSQL converts the `IN (...)` into `= ANY(array)`.

[0] https://www.postgresql.org/docs/current/functions-comparison...


Doesn't the first explain in the article show that the IN is converted to ANY anyway?


The problem is that it is slower to parse an `IN` query than an `= ANY(array)` query. Because the syntax for an array in PostgreSQL is simple, PostgreSQL can use a faster parser for the array. However, with `IN`, it cannot know the type of the next element in the list. Since `IN (a, 2, (select id from table), null)` is valid SQL, it needs a parser that permits all those scenarios. But with an `{1,2,3}::integer` array, it knows it can only have one layout.

Often people do not discuss the time it takes for the database to parse the SQL query. If you run an ORM, with a one to many relationship between two tables, which are modelled in an application layer, you may wish to preload the children onto the main object. Suppose you are loading thousands, if not millions of objects, and you wish to preload all of their children, your ORM might simply do this by running two SELECTs, first the main object's table, then children's table with the parent IDs of the main objects included in the query.

When using the `IN` syntax, parsing this will be very slow compared to the `= ANY(array)` syntax. I remember running some tests to compare them, and once you get to 100'000 values, `IN` becomes significantly slower compared to `= ANY(array)`. I cannot recall whether it was 100'000 or 1'000'000 values, but at one of them, PostgreSQL just never finished the job with `IN`, where `= ANY (array)` took a couple of seconds.


I'd argue that if you have an IN() with more than a dozen values, you're doing it wrong. Especially so if that's something that your app does often and it needs to be fast (as opposed to one-of analytic queries where you don't care about performance).


If your data is intrinsically shaped like 1:100, what are you supposed to do - issue 10x batches of fetches with 10 elements? Give up on a relational database? Start designing custom sharding schemes to avoid doing the "wrong" thing that might take a few ms longer?

And I'll have you know that we do actually care about the performance of analytic queries. The difference in performance might be minutes vs hours, or hours vs days. Just because it's not as quick as point lookup doesn't mean we are completely time insensitive.


If you're issuing a statement that includes an 'IN' statement with thousands of values, where are those values coming from, how are you getting them? In a relational database, a so large list of values generally indicates that they aren't arbitrary but would come from some table (perhaps with some condition), so the selection would be implemented with a table join - if you're fetching a huge list of IDs from the database and then passing them back as part of a query, that doesn't seem right.

In the grandparent comment, "Suppose you are loading thousands, if not millions of objects, and you wish to preload all of their children, your ORM might simply do this by running two SELECTs ..." - well if your ORM is doing something stupid, then you either fix the ORM or work around the ORM to get a reasonable query that will not uselessly send these millions of object IDs back to the database - what a waste of IO.


I was very confused about your comment (because the obvious answer is to use a JOIN), but then I realized you're talking about using a sharded database. To which my response is: don't. Just buy more RAM.

I realize that this doesn't scale indefinitely, but for the 99% of us who just need to manage a few billion rows, it's the right answer.


The other place where `ANY(array)` is useful is when interfacing with Postgres from code, where you will quickly find that building a `IN (?)` query from an array is disgusting, if not impossible.


It's frustrating when articles like this don't mention the version of the software tested. Because that does matter.

From the PostgreSQL 14 release notes: "Allow hash lookup for IN clauses with many constants (James Coleman, David Rowley). Previously the code always sequentially scanned the list of values."

And in PostgreSQL 15: "Allow hash lookup for NOT IN clauses with many constants (David Rowley, James Coleman). Previously the code always sequentially scanned the list of values."


The article mentioned that it's tested on Postgres 14:

"We used a PostgreSQL v14 database on a db.t3.medium Amazon RDS instance equipped with 2 CPUs and 4GB RAM. The storage capacity is 200GB (gp2 storage)."


Seems like an obvious optimization to use a hash instead of scanning the array with comparisons... but I bet it's hard because on small arrays sequentially scanning is faster than building a hash first. Does anyone know at what size the optimizer switches from scanning the list to a hash?


Based on my experiments, Postgres switches from scanning the list to a hash when the array size > 8.


The people writing these articles are usually not the most skilled. They're typically just filling their blogs. I swear I've seen this same exact subject at least 5 times before HN. They're al rehashing from each other.


This article perfectly answers the relatively simple question it asks. It is exactly the kind of search result I'm looking for if I ever wonder which approach is better and don't want to experiment myself.


Sometimes this may be the case but not so much here. Firstly the mention the version they used and, secondly, OtterTune are a startup based around automatic DB steup/schema optimization founded by folks from CMUs DB group, they most certainly know what they are talking about here.


Or they are new to the industry and thought that they might share their learning journey. Why be so cynical?


Because they don't usually say "I'm learning".


Because I've been around.


That's weird. I thought this would be optimized based on database statistics and not depend on the query syntax. At least for those simple queries.

Can anyone elaborate why this is the case?


from the article:

"For the OR clause query, the DBMS sequentially compares each condition's value one by one. In the given example, it performs three comparisons for each row to determine if (price = 1), (price = 2), and then (price = 3). This evaluation approach means that for N predicates, the complexity of the filtering operation per row is O(N).

On the other hand, with the IN clause, PostgreSQL builds a temporary hash table populated with the elements in the query’s IN clause (ExecEvalHashedScalarArrayOp). Then as the DBMS scans each tuple, it probes this hash table to see whether the tuple's attribute matches with any entry. This hash-based evaluation has a more efficient complexity of O(1) since the DBMS only needs to perform one lookup in the hash table per tuple.

Hence, as the number of predicates increases in a query, the OR clause incurs significant overhead because the DBMS evaluates predicates for each row individually, in contrast to the more efficient IN clause."


Yes, this explains the how, but not the why.

Why is there no optimization in place for this? Converting a=x or a=y or a=z to a in(x,y,z) should be trivial and the db should have heuristics to calculate the expected query cost to decide when to apply this transformation.


Yes, and Postgres remains staunchly opposed to planner hints because the planner knows better! It always computes the optimal query plan!

The "optimal query plan" changes at the drop of a hat, as you can see in this case. Absolutely trivial syntax changes result in a completely different query, sometimes turning a sequential scan into an index-only scan or vice versa. So, 100x difference in query time, it doesn't just do that for small tables.


To add to this, did you know changing the order of your joins can impact the query plan chosen?


At the scale of three comparisons, performance is equivalent enough that there's no gain.

At the scale of 5,000 comparisons, the language's structure itself heavily disincentivizes writing it as an OR query, unless you're using a code generator that doesn't care about language structure.

My wild guess is that it's a rare enough corner case that it wasn't worth burning the resources on yet.


My gut says it's a resource issue. Oracle has the time and money to optimize a gajillion scenarios that increase performance without making developers think things through. Which makes the product easier to use and seem faster. It's very simple to code this one, but then there are 10,000 other cases you need to code too to make it cover a large percentage of potential optimizations.


Indeed, Postgres may be wildly popular these days thanks to being $free, but it's far from being the most intelligent relational database engine on offer.


It's not that easy to analyze a binary logic predicate to extract a set of possible values (especially given the three-value logic that SQL uses, where `a=x or a≠x` is not true). It's easy in the easy cases of course, but doing the analysis for a more complex expression quickly gets ugly. And I'm not sure it's easy to tell ahead of time if an expression will be easy or hard to analyze. They could hard-code a few simple patterns, but at that point it may get very unpredictable whether the optimization will be applied or not, defeating the purpose.


Every optimizing compiler does this kind of analysis and then some.

It's already completely unpredictable which kind of trivial optimization Postgres fails to apply, it could hardly get any worse.


I would never use the OR operator instead of IN. Not unless perf massively favored OR, and then I would grit my teeth as I type it.


The results seem kind of obvious. The OR operator has the lowest precedence in SQL, requires a full statement after as well as continually widens the “view” that the query will ultimately see as a result of the union between all matching result sets. It does not imply the same things as an equivalent OR operator in a procedural language. The IN operator takes a single field and internally matched to all rows that match the given list of values.

Granted, SQL is different from typical procedural/OO/functional languages in the implication that the engine will create the “correct” internal representation of the query. I’m sure DBMSes can heuristically identify the case where somewhere just chains ORs whos only statement is comparing equal a literal value to the same field more than once, and collapse it to an IN. But why? If IN exists, why should the engine care about optimizing OR chains?


I was actually surprised that it mattered. I had expected the query optimizer to figure out the logical equivalent between the two and then pick the most optimal one for me.


Same here. I would have expected the OR an IN versions to generate the same query plan, it's seems like low hanging fruit for the query optimizer.


Please use neither. Reason is as number of lookups varies, you'll have to template query as a string adding either values to IN list or `OR` conditions. Not only it causes excessive string concatenation on your app side, these queries are also seen as distinct queries which has following drawbacks:

- every query has to be planned

- your DB driver can't prepare query as they are all different

- collecting per query stats becomes nightmare if number of arguments per query varies in wide range. metrics cardinality is a problem.

Correct way to handle it is pass all args as single parameter of type array and use `= ANY($1)` or if there are multiple columns build a virtual table and join:

  SELECT a,b FROM table 
  NATURAL JOIN ROWS FROM (
    unnest($1::type_of_a[]),
    unnest($2::type_of_b[])
  ) t(a,b)


>every query has to be planned

I thought PG planned every query anyway, it does not do plan caching.


It does, but AFAIk only for prepapred queries and cache is local to the backend serving query. See `plan_cache_mode` param.


Which has very limited usefulness since it only last for the session, so it helps if you are say doing the same query over and over on the same connection like bulk inserts but does not help for say the same select use over different requests where the connection is closed or given back to pool on each request which is far more common and would probably line up with the or/in issue here.

I don't even think PG caches plans for functions/sprocs. Back in the day with MSSQL before it had full plan caching based on statement text you could at least make sprocs for common queries and get plan reuse which had a large impact on perf in many apps.


Typically you try and avoid closing a connection on each request, you hand it back to a connection pool. The underlying session stays open and associated with the connection (if you have your pooler set up right), so subsequent requests still use the cache.

I agree with the original commenter about ANY as well: using IN for dynamic lists of parameters makes viewing useful information in e.g pg_stat_statements impossible, though it's possible there's been some recent work around normalizing these.


This is why I hate SQL. Once an abstraction fails so badly that you must understand minutia of its underlying implementation, it's now a burden instead of a benefit.


This is how I learned of the existence of the `EXPLAIN` keyword.

Somehow I managed to miss its existence all this time, and it immediately sews up my biggest frustration using relational DBs: the method by which they evaluate a request is completely opaque and more-or-less disjoint from the SQL language (i.e. the solution they use is completely open-ended), so it's nearly impossible to just guess whether a given query will be super expensive.

Very good to know.


You might be interested in sites like https://explain.dalibo.com/ which make the output a bit nicer to read. I use these quite often to quickly identify bottlenecks.


I think it’s a good habit to EXPLAIN all your production code. But watch out for the fact that the same query can optimise differently on different instances of the “same” database. Because the optimiser uses table statistics to find the “best” execution plan so if your dev/test have different datasets, or the stats are not properly maintained, you can get completely different execution plans between each system.


Looking at the execution plan, IN() or =ANY() seems faster: https://dbfiddle.uk/zIk9IrEN

OR reads the same number of buffers, but with a long execution plan


> Based on our results, we find that PostgreSQL consistently exhibits equivalent or better performance for queries with an IN clause than OR clauses when filtering on a single attribute. This is especially true for queries with a large number of predicates. For small condition sizes (e.g., less than 10), the performance difference between the two query types is negligible.

TL;DR


Shouldn't a database backend be able to semantically determine the two queries are equivalent and perform the same operation?


I have been using Postgres for 10 years and didn't knew you could have a IN with multiple columns. I don't see a lot of use cases for it right now though (in a regular app)


I have a CSV import. For this it is necessary to select by key columns. Multiple columns within IN come in handy for that. Unfortunately I don't use Postgres.


Same with Mongo. $in is usually much faster than many $or, especially combined with sorting, which warrants it's own article :)


When should you use IN instead of the OR operator?

Tl;dr;

Always.


This is true for Postgres. But might not be true for other databases. For example, for MySQL, in some cases OR is faster than IN operator based on our experiments.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: