TIL: Versions of UUID and when to use them

Lammy · on Aug 25, 2024

> UUID Version 2 (v2) is reserved for security IDs with no known details.

Only no known details if the only document you're reading is the notoriously poorly-specified RFC. Here you go: https://pubs.opengroup.org/onlinepubs/9696989899/chap5.htm#t...

There are also “version 0” UUIDs that you are very unlikely to ever come across but should be noted because they are the source of the reserved bits (via wastefully setting aside an entire octet for Address Family) that later allowed the other “versions” to be specified in a compatible way. Read my research about them here in my UUID library: https://github.com/okeeblow/DistorteD/blob/NEW%E2%80%85SENSA...

I decided to support them Because It's Cool™ but still need to figure out how to handle the date rollover of them and the even-older Apollo UIDs:

  irb> ::GlobeGlitter::from_ncs_time
  => "#<GlobeGlitter 40639cd25341.02.00.00.e0.4c.18.00.69>"
  irb> ::GlobeGlitter::from_ncs_time.to_time
  => 1988-12-21 14:52:02 UTC
  irb> ::GlobeGlitter::from_aegis_time
  => "#<GlobeGlitter 00000000-0000-0000-4814-17c8b0080069>"

(Proper AEGIS `#to_str` not implemented yet lol)

kyzer-davis · on Sept 4, 2024

> Only no known details if the only document you're reading is the notoriously poorly-specified RFC

To be fair, in RFC9562 I did cite two documents that are UUID Version 2 specifications. But RFC4122 was too cryptic for my taste.

As for the historical UUID types specified by the 0-7 Variant space: We are starting work on an informational RFC that will help folks understand those.

See https://github.com/yocto/draft-yocto-uuid if you want to add to some discussions and/or review text. We are still a bit early into the stages but I hope to have some progress soon.

dlgeek · on Aug 25, 2024

> UUID Version 2 (v2) is reserved for security IDs with no known details.

I found the details in about 2 minutes: Click the link in the article to take me to the section of RFC 9562 that says it's defined as part of DCE, click the first link in that paragraph to go to the spec, ctrl-f "UUID", then jump to appendix A (deceptively named "Universal Unique Identifier") which has all the details.

Is it really too much to ask to CLICK YOUR OWN LINKS?

dan-robertson · on Aug 26, 2024

I mean, there’s two things that one might learn about. One is reading a specification for what a v2 uuid is. The other is learning information about the circumstances of a UUID’s generation from the uuid. Obviously the sentence was about the latter, as could be determined from the context as the other versions are described as being generated from specific things.

I agree the sentence is a bit unclear, but I don’t think it’s misleading or whatever.

octernion · on Aug 25, 2024

haha I had the precise same thought process and immediately didn't finish the article since they didn't have much attention to detail.

i enjoyed reading the appendix though as a snapshot of time.

THBC · on Aug 25, 2024

Probably written by a language model

hamasho · on Aug 25, 2024

I wish there's a standard for short UUID, like `73WakrfVbNJBaAmhQtEeDv` or `bK7nP9xM`. I mean, it's not UUID cause it can be duplicated somewhere, I just want an ID standart combination of random and short enough to remember.

gregmac · on Aug 25, 2024

The closest that comes to minds is ULID[0]. It is short (26 character base32), 128 bit and lexicographically sortable.

I think the reason there's no other popular standard is you give up something. 128 bit gives a pretty low risk of collisions in almost all uses, but as you go smaller you start having to consider the specific scenario and impact, etc, which doesn't work well for a standard.

You could use another encoding (eg base64 or base85) to get it shorter, but you start sacrificing other things (case sensitivity, url-safeness) - again, not great for a standard.

[0] https://github.com/ulid/spec

notpushkin · on Aug 26, 2024

For my own project, I went with base32-encoded UUIDv7, prefixed with type name (a la Stripe ids). Compared to ULID, it's still lexicopraphically sortable, but is backed by an actual standard, so a bit more sound IMO.

UUIDv7 will only work until year 4147 (compared with ULID's 10889AD), but by then I think we'll have another UUID version we can switch to.

Here's my implementation in Python: https://codeberg.org/prettyid/python, https://pypi.org/project/prettyid

And a rudimentary TypeScript library: https://codeberg.org/prettyid/js, https://npm.im/prettyid

djbusby · on Aug 26, 2024

And ULID is translatable to UUID. So, eg, use ULID on display, links, etc and UUID data type in the language/DB

andrewstuart · on Aug 25, 2024

I was just today wanting shorter UUIDs so if you like a more compact/short UUID you can convert them like so to url safe base64.

It's the same UUID just in 22 character form and can be converted back. It's n ot really a conversion because a UUID is just a 128 bit value so its an alternative representation.

  483971cf-aad7-4c84-abf1-4a94c9d72f99 -> SDlxz6rXTISr8UqUydcvmQ (length: 22)
  fb67926f-3cfb-486c-a7da-30662147a20b -> A2eSbzz7SGyn2jBmIUeiCw (length: 22)
  799069a9-b32a-415f-b689-a8cc3f51bfa4 -> eZBpqbMqQVA2iajMP1GBpA (length: 22)
  8161ee0b-f7a5-4b32-95ea-9b9efe94e5f2 -> gWHuCBelSzKV6pueBpTl8g (length: 22)
  b1ea416c-f209-43cb-bfaf-d9cf6229459e -> sepBbPIJQ8uBr9nPYilFng (length: 22)
  ee70989a-b614-4665-9881-41054544c313 -> 7nCYmrYURmWYgUEFRUTDEw (length: 22)
  cce06fe2-b64f-47bc-a91a-d3dfd343e1e5 -> zOBv4rZPR7ypGtPf00Ph5Q (length: 22)
  aea3de6e-e769-4c8d-ba2d-77922d227176 -> rqPebudpTI26LXeSLSJxdg (length: 22)

    import uuid
    import base64
    
    def make_short_uuid(data):
        encoded = base64.urlsafe_b64encode(data).rstrip(b'=').decode('utf-8')
        return encoded.replace('-', 'A').replace('_', 'B')
    
    def generate_and_print_uuids():
        for _ in range(8):
            uuid_obj = uuid.uuid4()
            uuid_bytes = uuid_obj.bytes
            print(f'{uuid_obj} -> {make_short_uuid(uuid_bytes)} (length: {len(make_short_uuid(uuid_bytes))})')
    
    generate_and_print_uuids()

andrewstuart · on Aug 26, 2024

Thinking about it, the above is wrong because it replaces - and _ with A and B which makes it not reversible. This is reversible. You could aolso come up with a solution that has no - or _ by doing a custom base64 encode with only AZaz and 0 to 9.

If you really want to not have _ or - in your short form UUIDs you could just discard the UUID when you create it if the short form includes those characters and try again.

    c22c1dcf-ea74-470e-acbf-b1722e243025 -> wiwdz-p0Rw6sv7FyLiQwJQ (length: 22)
    Reversed: c22c1dcf-ea74-470e-acbf-b1722e243025
    8702aecb-6d09-4a5e-8cc8-621aada6ed96 -> hwKuy20JSl6MyGIarabtlg (length: 22)
    Reversed: 8702aecb-6d09-4a5e-8cc8-621aada6ed96
    643a9829-9f91-4b88-80a6-db2c0eb83e8b -> ZDqYKZ-RS4iAptssDrg-iw (length: 22)
    Reversed: 643a9829-9f91-4b88-80a6-db2c0eb83e8b
    8e7f3c1a-3d19-425e-8803-fe55a296688e -> jn88Gj0ZQl6IA_5VopZojg (length: 22)
    Reversed: 8e7f3c1a-3d19-425e-8803-fe55a296688e
    1859f017-a5f3-4875-825a-fdd531dfac1a -> GFnwF6XzSHWCWv3VMd-sGg (length: 22)
    Reversed: 1859f017-a5f3-4875-825a-fdd531dfac1a
    6a153b44-7fca-45b2-b13e-7f45790be7bf -> ahU7RH_KRbKxPn9FeQvnvw (length: 22)
    Reversed: 6a153b44-7fca-45b2-b13e-7f45790be7bf
    fd6bad83-a0f8-4c7f-baf1-10374be3e8e9 -> _Wutg6D4TH-68RA3S-Po6Q (length: 22)
    Reversed: fd6bad83-a0f8-4c7f-baf1-10374be3e8e9
    cf2452d4-947b-4b92-a280-ff869e77ba65 -> zyRS1JR7S5KigP-Gnne6ZQ (length: 22)
    Reversed: cf2452d4-947b-4b92-a280-ff869e77ba65
    
    import uuid
    import base64
    
    def make_short_uuid(data):
        return base64.urlsafe_b64encode(data).rstrip(b'=').decode('utf-8')
    
    def reverse_short_uuid(short_uuid):
        # Add padding back to make it Base64 decodable
        restored = short_uuid + '=' * (-len(short_uuid) % 4)
        # Decode the Base64 string back to bytes
        return base64.urlsafe_b64decode(restored)
    
    def generate_and_print_uuids():
        for _ in range(8):
            uuid_obj = uuid.uuid4()
            uuid_bytes = uuid_obj.bytes
            short_uuid = make_short_uuid(uuid_bytes)
            reversed_uuid_bytes = reverse_short_uuid(short_uuid)
            print(f'{uuid_obj} -> {short_uuid} (length: {len(short_uuid)})')
            print(f'Reversed: {uuid.UUID(bytes=reversed_uuid_bytes)}\n')
    
    generate_and_print_uuids()

pants2 · on Aug 25, 2024

Sqids[1] might fit the bill for you - the IDs it produces are much shorter than UUIDs, however they're not universally unique - they're generated from an integer sequence.

1. https://sqids.org/

slivanes · on Aug 25, 2024

A feature of a Sqid library I've used is that it can pad the value out to a minimum set of characters, so even an internal id of 1 can look substantial.

https://github.com/sqids/sqids-php

NetOpWibby · on Aug 25, 2024

Sqids looks fantastic, thanks for sharing!

wereHamster · on Aug 25, 2024

I usually generate N bits of randomness and base58 encode it. Choose N to your liking. You loose the benefits of monotonic sorting that is present in some UUID versions. Base58 is url safe and does not contain any special characters. And you can still store values as binary (eg. bytea in Postgres instead of a text column).

physicles · on Aug 26, 2024

I've also taken UUIDs and re-encoded them as base58. Works fine.

kyzer-davis · on Sept 4, 2024

We are working on standardizing an alternate encoding technique for the 128 bit UUID so it can have a shorter text form.

You can the discussions here: https://github.com/uuid6/new-uuid-encoding-techniques-ietf-d...

candiddevmike · on Aug 26, 2024

Nanoid? https://github.com/ai/nanoid

Terr_ · on Aug 26, 2024

> combination of random and short

IMO we need to be clear on the distinction between (A) the UUID bit-generation scheme versus (B) the way it is encoded for human use/reading/transcription.

They are mostly-separate problems.

For example, you could have a very secure mathematical scheme, but it gets ruined by a horrible representation where each bit is written as either a capital-I, a lowercase-l, or the number 1.

Conversely, could have a deeply insecure scheme that uses a nice compact serialization where everything is grouped into chunks and "1Il" confusion is not possible and there's a check-digit, etc.

tommy_axle · on Aug 25, 2024

Not a standard per se but nanoid seems to fit the bill. Widely implemented.

geitir · on Aug 25, 2024

Git uses SHA and then dynamically set the number of characters to use based on repository size. You could do something like this.

jfdjkfdhjds · on Aug 25, 2024

just use creation timedate plus auto increment int.

and then a small hash with base64 or 37 or whatever is in vogue these days.

thats what old timers used before uuid 1.

guess we should guerilla standardize something like this as uuid-0 or uuid-deprecated-2.0 for keeping up with the spirit.

selcuka · on Aug 25, 2024

> just use creation timedate plus auto increment int.

The problem with auto-increment integer ids is they are not always possible with distributed systems.

bigiain · on Aug 26, 2024

I've used schemes like "concatenate a shared secret, millisecond resolution times, local autoinc ID, and some sort of distributed machine identifier (like ip address or MAC address), then taken a truncated hash of that with as many bits as needed for the desired uniqueness guarantees."

I wouldn't use it for assigning bank account numbers, but for most web or app stuff it's fine.

sgarland · on Aug 26, 2024

They are, actually, you just have to coordinate the ranges each node has.

selcuka · on Aug 26, 2024

What if you don't know how many nodes you have? UUIDs can also be generated on the client side (in cases where you can trust the client).

sgarland · on Aug 26, 2024

> UUIDs can also be generated on the client side (in cases where you can trust the client).

I'm fairly certain the first rule of websec is you never trust the client. I definitely would not trust a user's browser to directly insert a value into a DB.

> What if you don't know how many nodes you have?

Shouldn't matter; you have a centralized system that hands out chunks of IDs on-demand (and has its own mechanism to ensure no repeats). This is similar to what Vitess [0] does.

[0]: https://vitess.io/docs/20.0/reference/features/vitess-sequen...

selcuka · on Aug 26, 2024

> I'm fairly certain the first rule of websec is you never trust the client.

Not every piece of information is confidential in every system. Sometimes a UUID is just that, a UUID.

> you have a centralized system that hands out chunks of IDs on-demand

I don't follow. If your system requires a central node that can reliably generate unique auto-incrementing integer IDs, why bother with UUIDs at all? Just base-64 encode the integer ID, or hash it with a salt to protect against enumeration attacks, if you want.

If you don't want the dependency to a centralised system, just use UUIDv7, which is just a timestamp plus random bits, or implement a shorter version of it. There is no need to overengineer.

sgarland · on Aug 26, 2024

> I don't follow. If your system requires a central node that can reliably generate unique auto-incrementing integer IDs, why bother with UUIDs at all?

I also don’t follow. I thought your initial assertion was that auto-incrementing integer IDs weren’t always possible, thus the need for UUIDs.

Monotonic ints, or more broadly anything k-sortable, are generally optimal for RDBMS indices due to most indices being B+trees. That’s why there’s such enormous effort towards NOT using UUIDv4.

> just use UUIDv7

Indeed; this is my recommendation when devs insist they can’t possibly use integers. Personally, I maintain that most places can use ints, it’s just that they’ve hideously over-complicated things to the point that it would be far too much work.

selcuka · on Aug 26, 2024

> your initial assertion was that auto-incrementing integer IDs weren’t always possible, thus the need for UUIDs.

You suggested timestamp+autoinc, and my initial assertion was that auto-incrementing integer IDs weren’t always possible, thus the need for the random part after the timestamp (a la UUIDv7). I see that we have actually been on the same page.

michaelt · on Aug 26, 2024

When the competition is a random 128 bit number, you can assume you've got a 32 bit node count (4.3 billion nodes) a 45 bit millisecond count (1100 years) and you've still got 51 bits letting each node generate 2 quadrillion IDs per millisecond.

The real benefit of UUIDs is the 'consistency' of the one-size-fits-most approach. If you can do without IDs humans can read out, or readable plain text logs, or compressibility, or recognisable formats for different types of ID? Then UUIDs can be used for anything from customer orders to web requests to log lines.

BohuTANG · on Aug 26, 2024

UUID v7's timestamp is a game-changer for Databend. We're using it to quickly locate metadata files on AWS S3 by timestamp, making operations like vacuuming much faster.

PR: https://github.com/datafuselabs/databend/pull/16049

dan-robertson · on Aug 26, 2024

Interesting. I don’t think I’d ever really consider using uuids to look something up by time (I’d probably want an actual timestamp field instead). To me, the big advantage of the uuids being sorted by time is that You get better locality – inserting new entries into an index is probably a mostly-append operation which could be cheaper than a random insert (but may have more contention too; worth considering a hybrid with some random bits before the timestamp to have more sorted ‘shards’) and in many places reads tend to be for more recent data so it can be nice to have the more recent data in one place (and eg pulled into more caches).

ElFitz · on Aug 26, 2024

It’s also great for key-value store like DynamoDB. Much cleaner than compound keys prefixed with a timestamp or iso date (or worse: wasting a secondary index on timestamps).

amarcheschi · on Aug 25, 2024

I'm failing at understanding what is the purpose of having uuid2. I didn't even know that more type existed till now. I had only encountered uuid2 when asking xandr to remove my personal data from its database. (discussion about xandr being asked to be investigated in Europe by noyb here https://news.ycombinator.com/item?id=40913915)

By reading the Wikipedia page I'm failing at understanding why we invented something called universally unique identifier and have different types of it, some of which can be traced back to the original pc. Is it because mixing some Mac codes increase the chance of the uuid2 being randomic or does it have a different reason? For privacy reason, could we just not have a very long identifier with many different chars to choose from so that we have so many combinations that we're almost guaranteed we're using non duplicated uuids?

Lammy · on Aug 25, 2024

Their original purpose was to identify messages in Apollo's distributed computing architecture. UID and later UUIDs were a reversible way to mark an intersection point between two dimensions. Any two machines would generate the same UID/UUID for the same two inputs, and a recipient of an identified message could reverse the identifier back into the original components. They were designed as labels for ephemeral messages so the two dimensions were time and hardware ID (originally Apollo serial number, later Ethernet hwaddress etc).

I think a lot of the confusion can be traced to the very earliest AEGIS implementation where the Apollo engineers started using “canned” (their term, i.e. static or well-known) UIDs to identify filesystems. Over time the popular usage of UUID fully shifted from ephemeral identifiers where duplicates were intentional toward canned identifiers where duplicates were unwanted and the two dimensions were random-and-also-random.

The history gets even more complicated because Microsoft hired one of the top Apollo guys to do MSRPC for Windows NT, so there is also “GUID” which differs from UUID in the layout of the fields and is not mixed-endian despite what a lot of sources will tell you. In addition to ephemeral RPC message-identifying GUIDs Microsoft are also in love with canned GUIDs for identifying COM classes, media codecs, and almost anything else that would ever need a well-known identifier. See https://gix.github.io/media-types/ for example.

Apologies for linking my own repo twice in the same comment section but I started (and need to get back to) compiling the history of all this in the README of my UUID library. Apollo started in 1980 and the Leach/Salz UUID RFC draft didn't happen until 1998 so there is a huge amount unsaid by the modern standards: https://github.com/okeeblow/DistorteD/blob/NEW%E2%80%85SENSA...

amarcheschi · on Aug 26, 2024

Thank you

efilife · on Aug 25, 2024

Uuid 4 is just a random bytes generator that inserts hyphens in specified places. You don't need to use it, you can just generate random bytes yourself and save on space (unnecessary hyphens, version info and so on)

Lammy · on Aug 25, 2024

UUIDs are 128-bit numbers and the hyphenated-string representation is only one of many ways to represent that number, sort of like how an IPv4 address is a 32-bit number of which the “dotted-quad” is only one representation. If you are thinking of UUID as a string format then your most fundamental concept of UUID is flawed.

Even if you do just want a random identifier (not really the original point of UUID but has become their most popular form) I still think it's cool how random UUIDs have a little flag bit to tell you that it's intended to be random. Useful when one runs across a lone identifier with zero context.

wongarsu · on Aug 25, 2024

UUID 4 also sets 4 bits to fixed values to indicate it's version 4. You can argue whether creating different namespaces between the different methods of creating UUIDs is useful. But your plain random number generator has only a 1/16 chance of generating a valid UUIDv4. (setting the bits correctly is however trivial if you do want to roll your own uuid generator)

badindentation · on Aug 25, 2024

From reading wikipedia it seems it's less than that since the 4th group also uses bit to set the variant.

e.g.

XXXXXXXX-XXXX-4XXX-[89AB]XXX-XXXXXXXX

From looking at all the ones in my system.

edflsafoiewq · on Aug 26, 2024

Each method is designed so two IDs both generated with that method don't collide with each other. Reserving bits for the method avoids the problem of having to also ensure that for all pair-wise combinations of methods, two IDs generated with those methods don't collide. This is also why there is only a finite set of universally agreed upon methods.

sweca · on Aug 25, 2024

True but the appeal for most developers is it's simple to implement. Virtually every language has a UUID library that works in one line or code.

Like in Go, it's just uuid.New().String() vs using crypto/rand to read random data, convert it into Base64 of hex... which will take more lines and effort.

lopkeny12ko · on Aug 25, 2024

This is an unfair argument. Which standard library in Go gives you uuid.New().String()? Anyone can publish a third party library that condenses reading random data, creating an identifier from it, and rendering it as a string into a single line of code API.

sweca · on Aug 25, 2024

No, it's not a standard library, but my point is UUID is standard and thus present in almost every language with a one line abstraction.

bigiain · on Aug 26, 2024

> Anyone can publish a third party library

I propose leftpad.js

Vecr · on Aug 25, 2024

I suggest not using any of the MAC based versions. In theory that could be anything other than v4 and v7, but v1 is the worst. As well as v3, MD5 is horribly broken.

tashbarg · on Aug 25, 2024

MD5 is “broken” as a cryptographic hash function. It still is perfectly fine as a non-cryptographic hash function.

Vecr · on Aug 25, 2024

Not really, it's slower than truncated blake3 for no gain and much loss.

bigiain · on Aug 26, 2024

There's some gain to be had in that I can reliably expect md5 to be available and compatible with pretty much anything back as far as Perl4 or PHP from the 90s, right up to bleeding edge version of Rust or Clojure or exotic language de jour.

Whether that's actually worth anything for a particular use case is a good question, and the answer will mostly be "not just no but HELL NO!"

zerodensity · on Aug 25, 2024

But is it slower than sha1? Which is the alternative if you don't roll your own in V8.

Vecr · on Aug 25, 2024

About six times faster compared to sha1. Depends on the hardware/cache environment.

slaymaker1907 · on Aug 25, 2024

Yeah, if you really need non-guessability, you should be using the version that’s completely random anyways.

ozim · on Aug 25, 2024

If you rely on non-guessability you use it as a security measure? So your sentence doesn’t invalidate previous poster.

motohagiography · on Aug 25, 2024

While I didn't know the details of ones other than 4, the one really useful one missing would be using some SHA256 data with a counter, not unlike PBKDF2. It could be a privacy preserving derived identifier, where you you could loosely prove a given UUID had been derived from a given seed.

dchest · on Aug 26, 2024

If you really need it, consider UUIDv4 as encoding, and use your cryptographic algorithm to create a 122-bit output and encode it as a UUID.

Otherwise, you'd want longer outputs.

tacitusarc · on Aug 25, 2024

Like V3 but with a specifiable hash algorithm.

kyzer-davis · on Sept 4, 2024

The appendix details how one can do a sha256 style v8 algo that is basically v5 with the algorithm swapped. In theory one could do this for any hashing algorithm. https://www.rfc-editor.org/rfc/rfc9562.html#name-example-of-...

Though I am working on a way to solve this problem with UUIDs beyond 128 bits so we don't have to truncate the hash.

MrDarcy · on Aug 25, 2024

Just use v7.

Cue the security experts who say otherwise…

wongarsu · on Aug 25, 2024

Use v4 if creation date could conceivably be sensitive information of if you depend on your uuids being completely unguessable. Otherwise use v7

stavros · on Aug 25, 2024

If we want a v7, shouldn't we use a ULID instead?

wongarsu · on Aug 25, 2024

When we didn't have UUIDv7, ULID was great. But now that we have v7 it's the more widely supported alternative. And apart from v7 setting the UUID version bits and having a different default representation they are not that different.

eropple · on Aug 25, 2024

ULID's presentation format is probably better for humans, though. You can double-click-to-highlight a ULID; the standard UUID representation doesn't like this.

(You can use ULID's presentational tools with UUIDv7, though.)

beart · on Aug 25, 2024

It doesn't help in other tools, but there is a css rule to help with this.

https://developer.mozilla.org/en-US/docs/Web/CSS/user-select

notpushkin · on Aug 26, 2024

> You can use ULID's presentational tools with UUIDv7, though.

I did that, works pretty good: https://codeberg.org/prettyid/python

Some more context in a sibling thread: https://news.ycombinator.com/item?id=41355218

Lammy · on Aug 25, 2024

> You can double-click-to-highlight a ULID; the standard UUID representation doesn't like this.

You can control this behavior in CSS with `user-select`. Peep my fiddle: https://jsfiddle.net/gLyph5km/

eropple · on Aug 25, 2024

Yup, in a browser you can. In my terminal or my text editor or Slack, I can't.

Lammy · on Aug 25, 2024

You're right that there isn't a good standard way to configure this, but a lot of terminals should be able to do it. The magic phrase is “word characters”. I know Konsole, gnome-terminal, iTerm2, Terminal-dot-app at least have this setting.

eropple · on Aug 25, 2024

Yep, I know about that (and WORDCHARS in zsh, for keyboard navigation). The thing is, though, UUIDs overload the dash character. I don't want dashes to be word characters except for a UUID.

(ULID representations also are shorter because they use a wider character set, which is nice though not critical.)

wtetzner · on Aug 25, 2024

> the standard UUID representation doesn't like this.

Yeah, I've gotten in the habit of stripping hyphens from the string representation of UUIDs in a lot of the code I write for that reason.

NetOpWibby · on Aug 25, 2024

Yeah I’m sticking with ULID and Sqids for my ID/slug purposes.

stavros · on Aug 25, 2024

Oh, I didn't realize v7 was newer than ULID, thanks.

6512398 · on Aug 26, 2024

Be careful if you're using python.

The python uuid standard library doesn't have V7 yet, and there is a package called uuid7 which is unmaintained, and not in compliance with the latest standard. That's using nanosecond time precision rather than millisecond, which means the leading bits are larger than they are meant to be.

If you use that unmaintained uuid7 package and later change to the correct implementation your uuid7 will go backwards, which is a breaking change considering that monotonicity is a key property of uuidv7.

bruce511 · on Aug 26, 2024

I believe your concern, while valid, only affects systems which are currently adding records faster than 1 per nano-second on occasion.

Also while technically true - it could technically break monotonicity (records added in the same nanosecond could be out of order) they'll still be all "near the end of the file, likely in the same page" such that performance implications are negligible.

As a general rule I would avoid any program making any assumptions about a uuid. Programs should treat it as an opaque binary random value. Doing so avoids any future incompatibilities.

fiddlerwoaroof · on Aug 25, 2024

An issue I've always had with UUIDs and ULIDs is there isn't a great way to generate one deterministically, as far as I can tell: for a lot of use-cases, being able to reprocess data and generate identical IDs is really useful and there isn't a standard way that I know of to achieve this.

mmiyer · on Aug 25, 2024

That's UUID v5 (uses a sha1 hash of input data).

VWWHFSfQ · on Aug 25, 2024

Are you looking for something other than just a custom seed in the RNG?

exe34 · on Aug 25, 2024

https://stackoverflow.com/a/64229385

fiddlerwoaroof · on Aug 25, 2024

Sure, there are workarounds in various languages, but it would nice to have a standardized hash-based UUID or ULID

IggleSniggle · on Aug 25, 2024

If it's a standardized sequence, then that's no different than just 0, 1, 2, 3 but with different names. If you just want a non-sequential but deterministic sequence, then that's every random number generator that accepts a seed value, and being anymore standardized than that makes zero sense.

fiddlerwoaroof · on Aug 26, 2024

The problem with autoincrement in this context is you can’t reproduce the right value when replaying the input streams for your stream processing job. Hashing some combination of values and using that as a primary key solves this problem nicely and, when you’re using bitemporal data modeling, makes it easy to correct mistakes. The point of standardization is compatibility, not standardizing the sequence of keys used.

IggleSniggle · on Aug 26, 2024

I agree on all points you're making, but you can't standardize on hashing when the data being hashed will vary due to business reasons. I just can't see any way that this can be realistically standardized outside of a single business, maybe even business-unit depending on the kind of company.

Perhaps you mean something like "standardized hash of all columnar data for the table row," but then you're just reinventing elasticsearch/lucene, with all its pros and cons. The power of foreign keys for a RDBS is that they are pointers, and as pointers, the mutability of their underlying data is what makes them powerful. I think I get what you're asking for, but I also think there can be no possible standard that is reasonable unless you have the technology to take a total snapshot of the universe, at which point, why not just measure the universe itself as your database? Perfect storage system.

1986 · on Aug 25, 2024

from the article, it sounds like this is V5?

fiddlerwoaroof · on Aug 25, 2024

I missed that because I typically am using ULIDs these days. But, yeah, some standardized format for a hash of message data is what I want.

1986 · on Aug 25, 2024

why wouldn't you use some sort of collision resistant hashing function on the data to achieve this instead?

zerodensity · on Aug 25, 2024

Some systems expect UUIDs so you don't always have that choice.

voidfunc · on Aug 26, 2024

v5... I use them all the time.

treve · on Aug 25, 2024

Why are you dismissive of security-related issues?

tonetegeatinst · on Aug 25, 2024

Because developers don't always consider the security aspect. Not saying this is what he's doing but could also just be due to how complex good software can be to write.

Their is a reason cybersecurity or UI/UX or product design isn't always left to the developer. The coder write code that fits certain criteria they are given, then someone down the line might QA check it, fuzz inputs or security review the code. How well this is done depends on the product,market, and environment.

MrDarcy · on Aug 26, 2024

I’m not dismissive of security related issues. My comment was a snarky reference to some other HN comments on uuid7 some time ago which I’m having trouble finding now. The comment was lamenting developers using uuid7 for many things, presumably without thought, creating loads of security issues for them.

In most cases creation time is not sensitive. Therefore, for most cases uuid7 is the best trade off currently.

sgarland · on Aug 26, 2024

Because at least half if not more of the industry sector is charlatans. Similar to how devs love to work on new shiny tech, regardless of its actual applicability to their needs, security folk will often insist that X is insecure due to absurdly unlikely edge case wherein you’re already pwned anyway, while ignoring some basic problem in their org.

elric · on Aug 26, 2024

That doesn't mean that it's not a valid concern in some cases, and it's good to be aware of it. When you're building something, and need a UUID, and stop think "v7 seems useful, but wait, wasn't there a security thing?", that's a win. Even if the answer turns out to be "not a problem in this case".

JSDevOps · on Aug 25, 2024

Interesting read. You learn something everyday.

pajeets · on Aug 25, 2024

is there something shorter than UUID

i hate how long it is

something like youtube URLs but guaranteed to be without duplicates

asperous · on Aug 25, 2024

One advantage of uuids is they can be generated on several distributed systems without having to check with each other that they are unique. Only long ids make this reliable. Youtube ids are random and short, but youtube has to check they are unique when generating them.

Maybe one way is to split up a random assignment space and assign to each distributed node, but that would be more complex.

imron · on Aug 25, 2024

And then there’s uuid5 which you can use to generate identical unique identifiers across multiple systems without having to check on each other. Very very useful to have in some circumstances.

wongarsu · on Aug 25, 2024

If you are fine with creating IDs in a centralized way (as you would do in 99% of cases anyways) you can just use a normal incrementing integer primary key. Then encrypt it with XTEA (either at your API boundary or in the database) to get non-sequential unguessable 64 bit keys. [1] has example code for postgres. If the original key don't have duplicates then the XTEA encrypted keys don't have duplicates either.

Then just encode it in a format of your choosing. Youtube uses a modified base64 encoding (no padding, and + and / are replaced by - and _). And youtube video ids seem to also be 64 bits, just like xtea output.

1: https://wiki.postgresql.org/wiki/XTEA_(crypt_64_bits)

quibono · on Aug 26, 2024

Are there any edge cases or things to be aware of when using this, or is it pretty much plug&play? I'm thinking of using this in one of my projects.

morepork · on Aug 26, 2024

The risk with an incrementing integer is that your database falls over then you can lose some IDs depending on how often you take backups. After you restore you need to make sure that whatever integer you start at is greater than the highest integer issued beforehand, which may be different to the highest integer in your restored DB. Otherwise you can have clients with the same ID.

syncsynchalt · on Aug 26, 2024

If you're not distributed, use an incrementing integer.

If you're distributed, look into vector clocks[1] or snowflake[2][3]

[1] https://en.wikipedia.org/wiki/Vector_clock

[2] https://github.com/twitter-archive/snowflake/tree/snowflake-...

[3] https://en.wikipedia.org/wiki/Snowflake_ID

wtetzner · on Aug 25, 2024

Even UUIDs are not guaranteed to not have duplicates. It's just extremely unlikely, largely due to their length.

bigiain · on Aug 26, 2024

Different use cases have differing requirements for uniqueness though. A lot of stuff doesn't need "you'd have to generate 1 billion v4 UUIDs per second for 85 years to have a 50% chance of a single collision." sort of guarantee.

You might think your "Uber, but for short term giraffe rental" startup needs that sort of guarantee in the investor demo prototype, but it doesn't. Just use an auto increment in Postgres or MySQL (or an integer primary key column in SQLite). If you fool those investors into pouring the money pipe over you, your first real technical/senior engineer hire is gonna throw all the code you have running on your laptop away anyway.

Maybe someone at Meta sat down once and figured "we have 4 billion users, each of who on average has 3 cats that they take a dozen picture of every day, so about 150 billion cat pictures per day. So if we name them using uuids we're still good for almost 50 thousand years before we have a 50% chance of displaying a pic of Fluffy when we should have displayed a pic of Mr Whiskers". Then they promptly ignored the problem (or fixed Zack's code that was using php's hash("md2", $query['catname']) ).

wtetzner · on Aug 26, 2024

It's not just a scale thing though. It also matters how problematic a collision will be.

But yes, if you're just using them as primary keys in a database you're probably fine with auto increment for most use cases.

pclmulqdq · on Aug 26, 2024

An atomic counter of some sort solves the problem of UUIDs with the cost of a synchronization step (although this can sort of be minimized). UUID and its variants are long to avoid duplicates without having to synchronize.

jagrsw · on Aug 25, 2024

Imagine how many careers have been built on inventing and promoting something, in the end, turned out to be a cleverly encoded output from /dev/urandom.

dchest · on Aug 26, 2024

Indeed. Also, add many other unique identifier formats (ULID, nanoid, bson ids ...) — in contrast to compatibility complaints here — and you'll see that people just love tinkering with trivial things (me too).

What was important during the times when we didn't know how to generate random numbers on computers, perhaps shouldn't be as important today?

zerodensity · on Aug 25, 2024

Immagine a world where every framework / API / database had its own incompatible UUID format. Without a standard specification that's where we would end up. Do you want to live in such a world?

bruce511 · on Aug 26, 2024

If you treat the uuid as an opaque binary random value (which is how programs -should- treat it) then variances between versions, or custom versions, have no effect.

As long as they gave sufficient randomness etc, from a program perspective they are unique id's.

There are already multiple versions in active use (4, 7 and arguably 8) so you really shouldn't be using the uuid as anything but a long-random-value.

Yes, the database engine may appreciate one version over another for performance reasons, but that's irrelevant to most developers and programs.

jagrsw · on Aug 26, 2024

Forget universally compatible UID formats. Frameworks, APIs, and databases only need consistency within their own ecosystem.

Want visually recognizable unique identifiers?

  JAGRSW-UID-<192bit-input-from-urandom-encoded-in-base64>

Need to shave off some bytes?

  JUID-<192bit-input-from-urandom-encoded-in-base64>

Same byte size as UUIDs, arguably more "secure." Can I become an ACM Fellow for solving this problem now?

Seriously, these UUID debates are about as sensible as arguing over XML.

benreesman · on Aug 26, 2024

TLDR: UUIDv4 is one of several reasonable choices, but if it ends up mattering: you’re rich and famous and can do arbitrary fixes.

Not getting pwned by incrementing href attacks is good, past that, web scale bro.