What kind of problems did you encounter?

welder · on Feb 5, 2024

Not OP, but I can answer this:

Integers don't scale because you need a central server to keep track of the next integer in the sequence. UUIDs and other random IDs can be generated distributed. Many examples, but the first one that comes to mind is Twitter writing their own custom UUID implementation to scale tweets [0]

[0]: https://blog.twitter.com/engineering/en_us/a/2010/announcing...

sgarland · on Feb 5, 2024

> Integers don't scale because you need a central server to keep track of the next integer in the sequence.

They most assuredly do scale. [0]

Also, Slack is built on MySQL + Vitess [1], the same system behind PlanetScale, which internally uses integer IDs [2].

[0]: https://www.enterprisedb.com/docs/pgd/latest/sequences/#glob...

[1]: https://slack.engineering/scaling-datastores-at-slack-with-v...

[2]: https://github.com/planetscale/discussion/discussions/366

CooCooCaCha · on Feb 5, 2024

I get what you’re saying but this feels like a premature optimization that only becomes necessary at scale.

It reminds me a bit of the microservices trend. People tried to mimic big tech companies but the community slowly realized that it’s not necessary for most companies and adds a lot of complexity.

I’ve worked at a variety of companies from small to medium-large and I can’t remember a single instance where we wish we used integer ids. It’s always been the opposite where we have to work around conflicts and auto incrementing.

sgarland · on Feb 5, 2024

In the same vein, distributed DBs are not required for most companies (from a technical standpoint; data locality for things like GDPR is another story). You can vertically scale _a lot_ before you even get close to the limits of a modern RDBMS. Like hundreds of thousands of QPS.

I've personally ran MySQL in RDS on a mid-level instance, nowhere near close to maxing out RAM or IOPS, and it handled 120K QPS just fine. Notably, this was with a lot of UUIDv4 PKs.

I'd wager with intelligent schema design, good queries, and careful tuning, you could surpass 1 million QPS on a single instance.

welder · on Feb 5, 2024

Auto-incrementing integers mean you're always dependent on a central server. UUIDs break that dependency, so you can scale writes up to multiple databases in parallel.

If you're using MySQL maybe integer ids make sense, because it scales differently than PostgreSQL.

sgarland · on Feb 5, 2024

If the DB fails to assign an ID, it's probably broken, so having an external ID won't help you.

If you're referring to not having conflicts between distributed nodes, that's a solved problem as well – distribute chunked ranges to each node of N size.

kiitos · on Feb 5, 2024

Whatever is distributing the chunks is still a point of central coordination.

Dylan16807 · on Feb 6, 2024

Yes, and?

If you can't manage minor levels of coordination because your database is on fire, the problem is that your database is on fire.

kiitos · on Feb 11, 2024

Fewer points of coordination is always better.

In general you shouldn't need to make a roundtrip to produce an ID.

Dylan16807 · on Feb 13, 2024

> Fewer points of coordination is always better.

The distributed database needs a coordination system anyway, so it's not an additional point.

> In general you shouldn't need to make a roundtrip to produce an ID.

Did you forget the context over the last week? We're already talking about reserving big chunks to remove the need to make a roundtrip to produce an ID. There would instead be something like one roundtrip per million IDs.

kiitos · on Feb 17, 2024

> The distributed database needs a coordination system anyway, so it's not an additional point.

Nope! Distributed databases do not necessarily need a "coordination system" in this sense. Most wide-scale distributed databases actually cannot rely on this kind of coordination.

> Did you forget the context over the last week? We're already talking about reserving big chunks to remove the need to make a roundtrip to produce an ID. There would instead be something like one roundtrip per million IDs.

OK, it's very clear that you're speaking from a context which is a very narrow subset of distributed systems as a whole. That's fine, just please understand your experience isn't broadly representative.

Dylan16807 · on Feb 19, 2024

> Nope! Distributed databases do not necessarily need a "coordination system" in this sense. Most wide-scale distributed databases actually cannot rely on this kind of coordination.

I'm assuming a system that tracks nodes and checks for quorum(s), because if you let isolated servers be authoritative then your data integrity goes to hell. If you have that system, you can use it for low-bandwidth coordinated decisions like reserving blocks of ids.

Am I wrong to think that most distributed databases have systems like that?

> OK, it's very clear that you're speaking from a context which is a very narrow subset of distributed systems as a whole. That's fine, just please understand your experience isn't broadly representative.

Sure, but the first thing you said in this conversation was "Whatever is distributing the chunks is still a point of central coordination." which is equally narrow, so I wasn't expecting you to suddenly broaden when I asked why that mattered.

kiitos · on Feb 21, 2024

> I'm assuming a system that tracks nodes and checks for quorum(s)

Not sure why.

> because if you let isolated servers be authoritative then your data integrity goes to hell

Many AP systems maintain data integrity without central authorities or quorums for data.

> Am I wrong to think that most distributed databases have systems like that?

No, not wrong! Just that it's one class of distributed systems, among many.

Dylan16807 · on Feb 23, 2024

Though if you're running AP then I sure hope you have a reconciliation system, and a good reconciliation system can handle that kind of ID conflict. (Maybe you still want to avoid it to speed that process up but that really gets into the weeds.)

akvadrako · on Feb 5, 2024

The way to solve that is giving each server it's own range of IDs.

giva · on Feb 5, 2024

Yes, but with PostegreSQL (and any other SQL server I'm aware of) you already have a central server that can do that. If you have multiple SQL server this won't work obv, unless you pair it with a unique server ID.

CooCooCaCha · on Feb 5, 2024

I recently worked on a data import project and because we used UUIDs I was able to generate all the ids offline. And because they’re randomly generated there was no risk of conflict.

This was nice because if the script failed half way through I could easily lookup which ids were already imported and continue where I left off.

The point is, this property of UUIDs occasionally comes in handy and it’s a life saver.

sgarland · on Feb 5, 2024

    postgres=# CREATE TABLE foo(id INT, bar TEXT);
    CREATE TABLE
    postgres=# INSERT INTO foo (id, bar) VALUES (1, 'Hello, world');
    INSERT 0 1
    postgres=# ALTER TABLE foo ALTER id SET NOT NULL, ALTER id ADD GENERATED 
               ALWAYS AS IDENTITY (START WITH 2);
    ALTER TABLE
    postgres=# INSERT INTO foo (bar) VALUES ('ACK');
    INSERT 0 1
    postgres=# TABLE foo;
     id |     bar
    ----+--------------
      1 | Hello, world
      2 | ACK
    (2 rows)

CooCooCaCha · on Feb 5, 2024

I don’t understand what you’re getting at. This was a pre-existing Postgres db in production.

I’m sure there’s a way to get it to work with integer ids but it would have been a pain. With UUID’s it was very simple to generate.

sgarland · on Feb 5, 2024

You said data import, so I assumed it was pulling rows into an empty table. The example I posted was a way to create a table with a static integer PK that you could rapidly generate in a loop, and then later convert it to auto-incrementing.

> I’m sure there’s a way to get it to work with integer ids but it would have been a pain. With UUID’s it was very simple to generate.

IME, if something is easy with RDBMS in prod, it usually means you’re paying for it later. This is definitely the case with UUIDv4 PKs.

CooCooCaCha · on Feb 7, 2024

No I mean an active prod table with people adding new rows all the time. It's just so much easier not having to worry about integer conflicts and auto-incrementing shenanigans.

But I get you like integers so whatever works for you, I just don't think they're the right tradeoff for most projects.

lazide · on Feb 5, 2024

This doesn’t really help you in this case, because the patch is to generate the UUIDs in the database?

welder · on Feb 5, 2024

Now you can use PG to generate the UUIDv7 in the beginning then easily switch to generating in the client if you need in the future, but I think OP was talking about UUID vs auto-incrementing integer in general not specific to Postgres.

lazide · on Feb 5, 2024

IMO it’s always been easier to generate them in the client. Every major platform has had libraries since forever.

dventimi · on Feb 5, 2024

They also leak information.

brodo · on Feb 5, 2024

I encountered this once: If you use integer IDs, try to scale horizontally, and do not generate the IDs in the database, you'll get in deep trouble. The solution for us was to let the DB handle ID generation.

giva · on Feb 5, 2024

Yes, but the only sane way to generate integer IDs is in the database.

nevir · on Feb 5, 2024

Here are some reasons for using UUIDs; not apply to all businesses:

- client-side generation (e.g. can reduce complexity when doing complex creation of data on the client side, and then some time later actually inserting it into to your db)

- sequential ids leak competitive information: https://en.wikipedia.org/wiki/German_tank_problem

- Global identification (being able to look up an unknown thing by just an id - very useful in log searching / admin dashboards / customer support tools)

int_19h · on Feb 8, 2024

It's also much easier to merge data from different sources when they all use UUIDs for row identification.