Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Word to the wise: be very careful about adding semantics to unique ids that aren't inherent to the identity of the thing being identified.

Over time conflicts between the id's primary job (uniquely identifying something) and the extra semantics can arise, and the solutions tend to get pretty messy.

Here we have a unique id that embeds a timestamp. The classic conflict here is with privacy/security. A UUIDv7 user id tells you when the user was created. A UUIDv7 of a medical record tells you when some medical event occurred.

There are things whose identity is inherently time-based and not private, so I'm not giving a blanket recommendation to not use these. Just understand what you are signing up for.

For a database, you can use bigints for primary ids but only internally. Then you also have an external random (v4) uuid... and a timestamp if you want, for that matter -- now that it's a separate column, you can expose/hide it on a case-by-case basis, depending on need. So this gets you the benefits of a uuidv7 but maintains flexibility, though at the cost of some complexity and extra bytes/record.

Other conflicts can arise too, and they can be hard to always foresee, so generally be careful about extra semantics in unique ids.



It looks like you're explaining this well, but I still don't understand what you're saying.


At my company we had a competitor scrape our API for various businesses. One of the fields was an bson ObjectId that represented when the customer entered our system. This unique identifier encodes a timestamp of its creation.

Our competitor was able to ascertain, based on that timestamp, when our customers contract was up and was (briefly) able to poach some customers by underbidding us until we corrected this.


Wow. I'm amazed by stories like.

I'm so naively the "take the high road" kind of guy, that I just assume everyone (or every company) should just do the right thing. Stealing customers in this way from a competitor, I have no ability to rationalize such an action.

And this kind of makes me scared, like if I were to ever own a business, I just know I'm swimming with sharks with no ability to defend.

I believe your story, but it's just crazy to me. Go earn a customer's business in a legit way, not be stealing data from a competitor.


Giving someone an offer when their existing contract is running out isn't that shady. Lots of companies ask for that. It shouldn't be hard to rationalize!


Uhh ... that's entirely what business is about. Some shady methods maybe, but also some pretty good customer acquisition methods


Businesses do not have to act unethically. Doing so limits who you can hire and collaborate with.


Honestly I didn't even think of it when I first wrote the endpoint. Our founder asked me "Is there anything we're sending over the API that might clue someone in when someone signs up? Are we sending a createdAt field or something?" and I said "No, but we do have a timestamp in one of the IDs..." -- well, we removed the field and this behavior stopped soon after.

Anyway, the arc of the universe bends toward justice: this (former) competitor got sold for parts a few years later.


Isn't the real issue here that your competitors had the authorization to see the contracts you had with other customers?


They didn't. We had something like:

    "locationId": "53c24146ef0b601b77974fcd"
They took the first four bytes (53c24146) which is a timestamp that represents 1405239622 seconds since Unix epoch. Our website clearly stated we work off annual contracts (a norm for our industry) - it wasn't secret information. So from this timestamp they could ballpark when a customer's contract was up.


How did you find out that they did?


The event timestamps embedded in the UUID can be correlated with external event streams, or even with other events within the same dataset, to de-anonymize the context of the event associated with the UUID. This is a common class of de-anonymization attack. Anything that allows temporal correlations to be inferred potentially leaks quite a lot of info about the data underlying the unique ids.


They're saying that in certain circumstances, if your API exposes the PK publicly, it may leak information you don't want leaked (the precise datetime something occurred, in the case of UUIDv7).

If that's an issue for you, you can get around this in a variety of ways, as they mention: you could use an associative table that maps the externally-exposed random ID to an internal-only ID.


Be careful about combining 2 pieces of information in to 1 column. From the above examples, you may want something to uniquely identify a record in your db and you may want something that tells you when the record was created. If you combine these two things, you then have a problem if you want to give an untrusted party that unique reference without telling them when it was created.


Don't use UUIDv7 if you want to keep the creation time of the event/id/entry private.


> A UUIDv7 of a medical record tells you when some medical event occurred.

It tells you when the event was documented. If the event didn't contain a date time stamp itself, I would be highly surprised, because what other value is there in documenting it?

The security problem here is inherent in the practice and your choice of primary key isn't a material factor at all.

Do you imagine there's a public CRUD database with simple Rails style accessors that can drill all the way down to individual event records inside my health information? And that, somehow the leak of a primary key in a URL might give away the fact that _something_ happened to me, medically, 12 days ago?


>And that, somehow the leak of a primary key in a URL might give away the fact that _something_ happened to me, medically, 12 days ago

Just as a side note: it may also leak the location, not just the time. An that is enough e.g. for disproving an alibi or leaking an important commercial secret (if you are in the same location as competitor HQ, for example).


UUIDv1 is the only one that specifies a MAC address to be included in the UUID. All others specify timestamp data, pseudo random values, or MD5/SHA1 string hashes.

I cannot rightly apprehend the scenario you are describing. A medical provider might generate a UUIDv1 and add it to a record of mine, and this will somehow destroy my ability to have an alibi in court with respect to corporate espionage?

I'm not in a bond movie, I just need to keep track of events and have them sort in chronological order


I‘m just pointing out that an identifier with timestamp in certain situations can leak more than just timestamp. E.g. an identifier of medical record in the hospital which location is known. Just this. The exact scenarios of how it can be leaked and how this information can be used are left to your imagination. The possibility of such attack depends on what’s at stake.


For good security, you don't leak internal IDs at all, sure. It is rare to find people doing that. And random primary keys (or any key really, and even more so for clustered indexes PK) really trash a db cache and locking systems.


IDs should just be IDs. You should be able to hand out your IDs on the street corner without compromising security. If knowing IDs has a negative impact on security, you've designed your system improperly.


I couldn't agree more!


No.

While it’s no longer on ranks on the top 10 web vulnerabilities, gaining internal insight to systems is one of first things you do when infiltrating.

But people are messy and lazy. Nowadays, you ask for GDPR data and people give you CSVs with all their real table and column names.

Sometimes when you are just a little inside, figuring out an id is like figuring out a password (particularly with uuid as opposed to a sequence). Real nice if it leaks easily.


Again, if knowing an ID allows someone to unfiltrate your systems, you've designed things poorly. IDs are not keys, and they should never be treated as such (looking at you, US Social Security numbers).


An ID is like social security number. When you give it away on the street corner, who knows what other people will do with it.

IMHO, leaking an ID always impose a risk. It is always have negative impact if leaked, no matter how perfect your system is.


That's hyperbolic. If your table ID poses security risks like a social security number, you've designed your system horribly wrong.


It’s like one small bug that lets you move sideways to you can use a bigger and better one.


You're almost always going to have to leak some sort of ID in an API, otherwise your API is going to be exceptionally hard to work with. You could choose to have a separate external ID, but provided that knowledge of an internal ID doesn't convey any information or additional privilege, it's not that big a deal.


The only reason why people can do dangerous things with an SSN is because it's used for authorization - i.e. as a secret - not just as ID. That's broken by design, but there's no reason to repeat that design.


> For good security, you don't leak internal IDs at all, sure. It is rare to find people doing that.

We must live in a different universe. I'd wager to say that over 90% of all backends leak their primary key when speaking to the front-facing client.


I think the statement was "It is rare to find people doing good security", not "It is rare to find people leaking internal IDs"


I read it as '[…] you don't leak internal IDs at all […]. It is rare to find people doing that [i.e. not leaking internal IDs]'


Yeah, sorry not clear. Typing on iPhone never really gets the message across as clearly as using a PC


99.9%


> And random primary keys (or any key really, and even more so for clustered indexes PK) really trash a db cache and locking systems.

That sounds like a problem which should be solved by making database engines not assume keys have some sane ordering, not by putting timestamps in UUIDs.


This is mostly correct but primary keys have to co-exist with the existing data infrastructure that is unlikely to be replaced for decades.

It is quite possible to do cluster-style indexing on UUIDs through disk on a single server at rates of tens of millions per second, I do it every day, just not with your typical ordered-tree architectures. Many popular database engines are not designed to make this particular scenario perform well.


I'm not sure why that'd inherently be a problem? Knowledge that some record exists about a medical event at a particular time is not too problematic, compared to _who_ it happened to


very rare diseases may be an exception, especially in known geographic locations. that can become identifiable (and is governed by HIPAA in the US)


Usually when you want this property it is also a benefit that your events when ordered by primary key, also gives a rough ordering by time, no need for a secondary index




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: