There are also “version 0” UUIDs that you are very unlikely to ever come across but should be noted because they are the source of the reserved bits (via wastefully setting aside an entire octet for Address Family) that later allowed the other “versions” to be specified in a compatible way. Read my research about them here in my UUID library: https://github.com/okeeblow/DistorteD/blob/NEW%E2%80%85SENSA...
I decided to support them Because It's Cool™ but still need to figure out how to handle the date rollover of them and the even-older Apollo UIDs:
> Only no known details if the only document you're reading is the notoriously poorly-specified RFC
To be fair, in RFC9562 I did cite two documents that are UUID Version 2 specifications. But RFC4122 was too cryptic for my taste.
As for the historical UUID types specified by the 0-7 Variant space:
We are starting work on an informational RFC that will help folks understand those.
See https://github.com/yocto/draft-yocto-uuid if you want to add to some discussions and/or review text. We are still a bit early into the stages but I hope to have some progress soon.
> UUID Version 2 (v2) is reserved for security IDs with no known details.
I found the details in about 2 minutes: Click the link in the article to take me to the section of RFC 9562 that says it's defined as part of DCE, click the first link in that paragraph to go to the spec, ctrl-f "UUID", then jump to appendix A (deceptively named "Universal Unique Identifier") which has all the details.
Is it really too much to ask to CLICK YOUR OWN LINKS?
I mean, there’s two things that one might learn about. One is reading a specification for what a v2 uuid is. The other is learning information about the circumstances of a UUID’s generation from the uuid. Obviously the sentence was about the latter, as could be determined from the context as the other versions are described as being generated from specific things.
I agree the sentence is a bit unclear, but I don’t think it’s misleading or whatever.
I wish there's a standard for short UUID, like `73WakrfVbNJBaAmhQtEeDv` or `bK7nP9xM`. I mean, it's not UUID cause it can be duplicated somewhere, I just want an ID standart combination of random and short enough to remember.
The closest that comes to minds is ULID[0]. It is short (26 character base32), 128 bit and lexicographically sortable.
I think the reason there's no other popular standard is you give up something. 128 bit gives a pretty low risk of collisions in almost all uses, but as you go smaller you start having to consider the specific scenario and impact, etc, which doesn't work well for a standard.
You could use another encoding (eg base64 or base85) to get it shorter, but you start sacrificing other things (case sensitivity, url-safeness) - again, not great for a standard.
For my own project, I went with base32-encoded UUIDv7, prefixed with type name (a la Stripe ids). Compared to ULID, it's still lexicopraphically sortable, but is backed by an actual standard, so a bit more sound IMO.
UUIDv7 will only work until year 4147 (compared with ULID's 10889AD), but by then I think we'll have another UUID version we can switch to.
I was just today wanting shorter UUIDs so if you like a more compact/short UUID you can convert them like so to url safe base64.
It's the same UUID just in 22 character form and can be converted back. It's n ot really a conversion because a UUID is just a 128 bit value so its an alternative representation.
Thinking about it, the above is wrong because it replaces - and _ with A and B which makes it not reversible. This is reversible. You could aolso come up with a solution that has no - or _ by doing a custom base64 encode with only AZaz and 0 to 9.
If you really want to not have _ or - in your short form UUIDs you could just discard the UUID when you create it if the short form includes those characters and try again.
Sqids[1] might fit the bill for you - the IDs it produces are much shorter than UUIDs, however they're not universally unique - they're generated from an integer sequence.
A feature of a Sqid library I've used is that it can pad the value out to a minimum set of characters, so even an internal id of 1 can look substantial.
I usually generate N bits of randomness and base58 encode it. Choose N to your liking. You loose the benefits of monotonic sorting that is present in some UUID versions. Base58 is url safe and does not contain any special characters. And you can still store values as binary (eg. bytea in Postgres instead of a text column).
IMO we need to be clear on the distinction between (A) the UUID bit-generation scheme versus (B) the way it is encoded for human use/reading/transcription.
They are mostly-separate problems.
For example, you could have a very secure mathematical scheme, but it gets ruined by a horrible representation where each bit is written as either a capital-I, a lowercase-l, or the number 1.
Conversely, could have a deeply insecure scheme that uses a nice compact serialization where everything is grouped into chunks and "1Il" confusion is not possible and there's a check-digit, etc.
I've used schemes like "concatenate a shared secret, millisecond resolution times, local autoinc ID, and some sort of distributed machine identifier (like ip address or MAC address), then taken a truncated hash of that with as many bits as needed for the desired uniqueness guarantees."
I wouldn't use it for assigning bank account numbers, but for most web or app stuff it's fine.
> UUIDs can also be generated on the client side (in cases where you can trust the client).
I'm fairly certain the first rule of websec is you never trust the client. I definitely would not trust a user's browser to directly insert a value into a DB.
> What if you don't know how many nodes you have?
Shouldn't matter; you have a centralized system that hands out chunks of IDs on-demand (and has its own mechanism to ensure no repeats). This is similar to what Vitess [0] does.
> I'm fairly certain the first rule of websec is you never trust the client.
Not every piece of information is confidential in every system. Sometimes a UUID is just that, a UUID.
> you have a centralized system that hands out chunks of IDs on-demand
I don't follow. If your system requires a central node that can reliably generate unique auto-incrementing integer IDs, why bother with UUIDs at all? Just base-64 encode the integer ID, or hash it with a salt to protect against enumeration attacks, if you want.
If you don't want the dependency to a centralised system, just use UUIDv7, which is just a timestamp plus random bits, or implement a shorter version of it. There is no need to overengineer.
> I don't follow. If your system requires a central node that can reliably generate unique auto-incrementing integer IDs, why bother with UUIDs at all?
I also don’t follow. I thought your initial assertion was that auto-incrementing integer IDs weren’t always possible, thus the need for UUIDs.
Monotonic ints, or more broadly anything k-sortable, are generally optimal for RDBMS indices due to most indices being B+trees. That’s why there’s such enormous effort towards NOT using UUIDv4.
> just use UUIDv7
Indeed; this is my recommendation when devs insist they can’t possibly use integers. Personally, I maintain that most places can use ints, it’s just that they’ve hideously over-complicated things to the point that it would be far too much work.
> your initial assertion was that auto-incrementing integer IDs weren’t always possible, thus the need for UUIDs.
You suggested timestamp+autoinc, and my initial assertion was that auto-incrementing integer IDs weren’t always possible, thus the need for the random part after the timestamp (a la UUIDv7). I see that we have actually been on the same page.
When the competition is a random 128 bit number, you can assume you've got a 32 bit node count (4.3 billion nodes) a 45 bit millisecond count (1100 years) and you've still got 51 bits letting each node generate 2 quadrillion IDs per millisecond.
The real benefit of UUIDs is the 'consistency' of the one-size-fits-most approach. If you can do without IDs humans can read out, or readable plain text logs, or compressibility, or recognisable formats for different types of ID? Then UUIDs can be used for anything from customer orders to web requests to log lines.
UUID v7's timestamp is a game-changer for Databend. We're using it to quickly locate metadata files on AWS S3 by timestamp, making operations like vacuuming much faster.
Interesting. I don’t think I’d ever really consider using uuids to look something up by time (I’d probably want an actual timestamp field instead). To me, the big advantage of the uuids being sorted by time is that You get better locality – inserting new entries into an index is probably a mostly-append operation which could be cheaper than a random insert (but may have more contention too; worth considering a hybrid with some random bits before the timestamp to have more sorted ‘shards’) and in many places reads tend to be for more recent data so it can be nice to have the more recent data in one place (and eg pulled into more caches).
It’s also great for key-value store like DynamoDB. Much cleaner than compound keys prefixed with a timestamp or iso date (or worse: wasting a secondary index on timestamps).
I'm failing at understanding what is the purpose of having uuid2. I didn't even know that more type existed till now. I had only encountered uuid2 when asking xandr to remove my personal data from its database. (discussion about xandr being asked to be investigated in Europe by noyb here https://news.ycombinator.com/item?id=40913915)
By reading the Wikipedia page I'm failing at understanding why we invented something called universally unique identifier and have different types of it, some of which can be traced back to the original pc. Is it because mixing some Mac codes increase the chance of the uuid2 being randomic or does it have a different reason? For privacy reason, could we just not have a very long identifier with many different chars to choose from so that we have so many combinations that we're almost guaranteed we're using non duplicated uuids?
Their original purpose was to identify messages in Apollo's distributed computing architecture. UID and later UUIDs were a reversible way to mark an intersection point between two dimensions. Any two machines would generate the same UID/UUID for the same two inputs, and a recipient of an identified message could reverse the identifier back into the original components. They were designed as labels for ephemeral messages so the two dimensions were time and hardware ID (originally Apollo serial number, later Ethernet hwaddress etc).
I think a lot of the confusion can be traced to the very earliest AEGIS implementation where the Apollo engineers started using “canned” (their term, i.e. static or well-known) UIDs to identify filesystems. Over time the popular usage of UUID fully shifted from ephemeral identifiers where duplicates were intentional toward canned identifiers where duplicates were unwanted and the two dimensions were random-and-also-random.
The history gets even more complicated because Microsoft hired one of the top Apollo guys to do MSRPC for Windows NT, so there is also “GUID” which differs from UUID in the layout of the fields and is not mixed-endian despite what a lot of sources will tell you. In addition to ephemeral RPC message-identifying GUIDs Microsoft are also in love with canned GUIDs for identifying COM classes, media codecs, and almost anything else that would ever need a well-known identifier. See https://gix.github.io/media-types/ for example.
Apologies for linking my own repo twice in the same comment section but I started (and need to get back to) compiling the history of all this in the README of my UUID library. Apollo started in 1980 and the Leach/Salz UUID RFC draft didn't happen until 1998 so there is a huge amount unsaid by the modern standards: https://github.com/okeeblow/DistorteD/blob/NEW%E2%80%85SENSA...
Uuid 4 is just a random bytes generator that inserts hyphens in specified places. You don't need to use it, you can just generate random bytes yourself and save on space (unnecessary hyphens, version info and so on)
UUIDs are 128-bit numbers and the hyphenated-string representation is only one of many ways to represent that number, sort of like how an IPv4 address is a 32-bit number of which the “dotted-quad” is only one representation. If you are thinking of UUID as a string format then your most fundamental concept of UUID is flawed.
Even if you do just want a random identifier (not really the original point of UUID but has become their most popular form) I still think it's cool how random UUIDs have a little flag bit to tell you that it's intended to be random. Useful when one runs across a lone identifier with zero context.
UUID 4 also sets 4 bits to fixed values to indicate it's version 4. You can argue whether creating different namespaces between the different methods of creating UUIDs is useful. But your plain random number generator has only a 1/16 chance of generating a valid UUIDv4. (setting the bits correctly is however trivial if you do want to roll your own uuid generator)
Each method is designed so two IDs both generated with that method don't collide with each other. Reserving bits for the method avoids the problem of having to also ensure that for all pair-wise combinations of methods, two IDs generated with those methods don't collide. This is also why there is only a finite set of universally agreed upon methods.
True but the appeal for most developers is it's simple to implement. Virtually every language has a UUID library that works in one line or code.
Like in Go, it's just uuid.New().String() vs using crypto/rand to read random data, convert it into Base64 of hex... which will take more lines and effort.
This is an unfair argument. Which standard library in Go gives you uuid.New().String()? Anyone can publish a third party library that condenses reading random data, creating an identifier from it, and rendering it as a string into a single line of code API.
I suggest not using any of the MAC based versions. In theory that could be anything other than v4 and v7, but v1 is the worst. As well as v3, MD5 is horribly broken.
There's some gain to be had in that I can reliably expect md5 to be available and compatible with pretty much anything back as far as Perl4 or PHP from the 90s, right up to bleeding edge version of Rust or Clojure or exotic language de jour.
Whether that's actually worth anything for a particular use case is a good question, and the answer will mostly be "not just no but HELL NO!"
While I didn't know the details of ones other than 4, the one really useful one missing would be using some SHA256 data with a counter, not unlike PBKDF2. It could be a privacy preserving derived identifier, where you you could loosely prove a given UUID had been derived from a given seed.
When we didn't have UUIDv7, ULID was great. But now that we have v7 it's the more widely supported alternative. And apart from v7 setting the UUID version bits and having a different default representation they are not that different.
ULID's presentation format is probably better for humans, though. You can double-click-to-highlight a ULID; the standard UUID representation doesn't like this.
(You can use ULID's presentational tools with UUIDv7, though.)
You're right that there isn't a good standard way to configure this, but a lot of terminals should be able to do it. The magic phrase is “word characters”. I know Konsole, gnome-terminal, iTerm2, Terminal-dot-app at least have this setting.
Yep, I know about that (and WORDCHARS in zsh, for keyboard navigation). The thing is, though, UUIDs overload the dash character. I don't want dashes to be word characters except for a UUID.
(ULID representations also are shorter because they use a wider character set, which is nice though not critical.)
The python uuid standard library doesn't have V7 yet, and there is a package called uuid7 which is unmaintained, and not in compliance with the latest standard. That's using nanosecond time precision rather than millisecond, which means the leading bits are larger than they are meant to be.
If you use that unmaintained uuid7 package and later change to the correct implementation your uuid7 will go backwards, which is a breaking change considering that monotonicity is a key property of uuidv7.
I believe your concern, while valid, only affects systems which are currently adding records faster than 1 per nano-second on occasion.
Also while technically true - it could technically break monotonicity (records added in the same nanosecond could be out of order) they'll still be all "near the end of the file, likely in the same page" such that performance implications are negligible.
As a general rule I would avoid any program making any assumptions about a uuid. Programs should treat it as an opaque binary random value. Doing so avoids any future incompatibilities.
An issue I've always had with UUIDs and ULIDs is there isn't a great way to generate one deterministically, as far as I can tell: for a lot of use-cases, being able to reprocess data and generate identical IDs is really useful and there isn't a standard way that I know of to achieve this.
If it's a standardized sequence, then that's no different than just 0, 1, 2, 3 but with different names. If you just want a non-sequential but deterministic sequence, then that's every random number generator that accepts a seed value, and being anymore standardized than that makes zero sense.
The problem with autoincrement in this context is you can’t reproduce the right value when replaying the input streams for your stream processing job. Hashing some combination of values and using that as a primary key solves this problem nicely and, when you’re using bitemporal data modeling, makes it easy to correct mistakes. The point of standardization is compatibility, not standardizing the sequence of keys used.
I agree on all points you're making, but you can't standardize on hashing when the data being hashed will vary due to business reasons. I just can't see any way that this can be realistically standardized outside of a single business, maybe even business-unit depending on the kind of company.
Perhaps you mean something like "standardized hash of all columnar data for the table row," but then you're just reinventing elasticsearch/lucene, with all its pros and cons. The power of foreign keys for a RDBS is that they are pointers, and as pointers, the mutability of their underlying data is what makes them powerful. I think I get what you're asking for, but I also think there can be no possible standard that is reasonable unless you have the technology to take a total snapshot of the universe, at which point, why not just measure the universe itself as your database? Perfect storage system.
Because developers don't always consider the security aspect. Not saying this is what he's doing but could also just be due to how complex good software can be to write.
Their is a reason cybersecurity or UI/UX or product design isn't always left to the developer. The coder write code that fits certain criteria they are given, then someone down the line might QA check it, fuzz inputs or security review the code. How well this is done depends on the product,market, and environment.
I’m not dismissive of security related issues. My comment was a snarky reference to some other HN comments on uuid7 some time ago which I’m having trouble finding now. The comment was lamenting developers using uuid7 for many things, presumably without thought, creating loads of security issues for them.
In most cases creation time is not sensitive. Therefore, for most cases uuid7 is the best trade off currently.
Because at least half if not more of the industry sector is charlatans. Similar to how devs love to work on new shiny tech, regardless of its actual applicability to their needs, security folk will often insist that X is insecure due to absurdly unlikely edge case wherein you’re already pwned anyway, while ignoring some basic problem in their org.
That doesn't mean that it's not a valid concern in some cases, and it's good to be aware of it. When you're building something, and need a UUID, and stop think "v7 seems useful, but wait, wasn't there a security thing?", that's a win. Even if the answer turns out to be "not a problem in this case".
One advantage of uuids is they can be generated on several distributed systems without having to check with each other that they are unique. Only long ids make this reliable. Youtube ids are random and short, but youtube has to check they are unique when generating them.
Maybe one way is to split up a random assignment space and assign to each distributed node, but that would be more complex.
And then there’s uuid5 which you can use to generate identical unique identifiers across multiple systems without having to check on each other. Very very useful to have in some circumstances.
If you are fine with creating IDs in a centralized way (as you would do in 99% of cases anyways) you can just use a normal incrementing integer primary key. Then encrypt it with XTEA (either at your API boundary or in the database) to get non-sequential unguessable 64 bit keys. [1] has example code for postgres. If the original key don't have duplicates then the XTEA encrypted keys don't have duplicates either.
Then just encode it in a format of your choosing. Youtube uses a modified base64 encoding (no padding, and + and / are replaced by - and _). And youtube video ids seem to also be 64 bits, just like xtea output.
The risk with an incrementing integer is that your database falls over then you can lose some IDs depending on how often you take backups. After you restore you need to make sure that whatever integer you start at is greater than the highest integer issued beforehand, which may be different to the highest integer in your restored DB. Otherwise you can have clients with the same ID.
Different use cases have differing requirements for uniqueness though. A lot of stuff doesn't need "you'd have to generate 1 billion v4 UUIDs per second for 85 years to have a 50% chance of a single collision." sort of guarantee.
You might think your "Uber, but for short term giraffe rental" startup needs that sort of guarantee in the investor demo prototype, but it doesn't. Just use an auto increment in Postgres or MySQL (or an integer primary key column in SQLite). If you fool those investors into pouring the money pipe over you, your first real technical/senior engineer hire is gonna throw all the code you have running on your laptop away anyway.
Maybe someone at Meta sat down once and figured "we have 4 billion users, each of who on average has 3 cats that they take a dozen picture of every day, so about 150 billion cat pictures per day. So if we name them using uuids we're still good for almost 50 thousand years before we have a 50% chance of displaying a pic of Fluffy when we should have displayed a pic of Mr Whiskers". Then they promptly ignored the problem (or fixed Zack's code that was using php's hash("md2", $query['catname']) ).
An atomic counter of some sort solves the problem of UUIDs with the cost of a synchronization step (although this can sort of be minimized). UUID and its variants are long to avoid duplicates without having to synchronize.
Imagine how many careers have been built on inventing and promoting something, in the end, turned out to be a cleverly encoded output from /dev/urandom.
Indeed. Also, add many other unique identifier formats (ULID, nanoid, bson ids ...) — in contrast to compatibility complaints here — and you'll see that people just love tinkering with trivial things (me too).
What was important during the times when we didn't know how to generate random numbers on computers, perhaps shouldn't be as important today?
Immagine a world where every framework / API / database had its own incompatible UUID format. Without a standard specification that's where we would end up. Do you want to live in such a world?
If you treat the uuid as an opaque binary random value (which is how programs -should- treat it) then variances between versions, or custom versions, have no effect.
As long as they gave sufficient randomness etc, from a program perspective they are unique id's.
There are already multiple versions in active use (4, 7 and arguably 8) so you really shouldn't be using the uuid as anything but a long-random-value.
Yes, the database engine may appreciate one version over another for performance reasons, but that's irrelevant to most developers and programs.
Only no known details if the only document you're reading is the notoriously poorly-specified RFC. Here you go: https://pubs.opengroup.org/onlinepubs/9696989899/chap5.htm#t...
There are also “version 0” UUIDs that you are very unlikely to ever come across but should be noted because they are the source of the reserved bits (via wastefully setting aside an entire octet for Address Family) that later allowed the other “versions” to be specified in a compatible way. Read my research about them here in my UUID library: https://github.com/okeeblow/DistorteD/blob/NEW%E2%80%85SENSA...
I decided to support them Because It's Cool™ but still need to figure out how to handle the date rollover of them and the even-older Apollo UIDs:
(Proper AEGIS `#to_str` not implemented yet lol)