Hacker News new | past | comments | ask | show | jobs | submit login
A New Life for Certificate Revocation Lists (letsencrypt.org)
121 points by grappler on Sept 7, 2022 | hide | past | favorite | 69 comments



I was kind of surprised that OCSP stapling didn't get any mention at all. I thought that was a major improvement in both resource cost and privacy? Since the time stamped response is proxied by the site operator rather then going directly to the CA, the load is almost entirely switched to the site itself, and the site itself is the only one who knows a given IP is asking for it, which is fine because obviously the site knows a given browser is connecting to it anyway. It's decentralized again. I vaguely remember at one point there was a major limitation of only supporting a single OCSP response at a time but I thought that was dealt with via a later RFC and then entirely obviated as an issue by TLS 1.3.

Did something happen there or some other significant issue get discovered? I'm curious why the move back to CRLs (albeit improved) vs must-staple. It seemed like a reasonable elegant and straight forward solution that fit the web pretty well.


Yes, I noticed this too and asked the author about it [0].

IMO, OCSP stapling is the best overall solution until certificate lifetimes are shorter (< 7 days).

As a server developer I'm worried that the focus on independent CRLs will make it difficult to automate certificates in the face of revocation. Currently, Caddy staples OCSP for all certificates by default, caches the staples, and refreshes them halfway through their lifetime. Works great. Every server should do this. And when an OCSP response is discovered to be Revoked, Caddy automatically replaces the certificate. Works great. Every server should do this.

If every browser is independently going to decide which certificates to distrust, now I am not sure of a good, authoritative way to determine "revoked" and then replace certificates automatically. I'm worried this will hurt the TLS ecosystem unless we answer those questions first.

Actually, let's just shorten certificate lifetimes and be done with it already.

Main blockers to short cert lifetimes:

- CA's uptime determines Web's uptime.

Main solution:

- Multiple redundant ACME CAs. If one goes down, try another. (This is what Caddy already does.)

[0]: https://twitter.com/mholt6/status/1567559325949763588


Replying to this as it's on top, and first thanks for the reply (and sibling replies as well). The challenges of getting servers to update and implement being a practical roadblock makes sense, I was more surprised just to not see it mentioned at all. I'd have been fully prepared for a few sentences along the lines of "this would be ideal if doing it from scratch but it'd be hard to get everyone to go along now, the reverse perils of decentralization". What you write is also interesting.


I too found this troubling.

Mainly because we are not talking about validation performance, and OCSP stapling is an excellent performance fix.

Not to mention all the privacy issues that OCSP stapling really fixes.


Another tool in the toolbox is ARI. https://datatracker.ietf.org/doc/draft-acme-ari/


I am lukewarm about ARI. It's nice that clients can know the most convenient time to renew a certificate to avoid overwhelming the server.

However, it is optional, so we will repeat the same mistake we made with optional OCSP stapling.

Only nice clients will implement ARI, but they are the clients that need it the least because if they go to the trouble to support ARI they probably already have friendly netizen programming.

As for revocations, ARI doesn't make much sense to me. If we know a certificate will be revoked soon, we might as well stop trusting it right now. Why continue to trust a certificate that we know is being revoked?

Maybe I'm totally missing the point of ARI.


Here's my pitch:

The worst revocations are mass ones that the site operator didn't request. ARI gives a heads-up to renew early, maybe because all certificates issued using HTTP-01 are getting revoked in two days.

Revocations aren't always about distrusting a specific certificate. There's likely nothing wrong with the certificate, but it needs replacement for ecosystem cleanliness. Regardless, 40M certs are being revoked in a 10 minute window on Saturday (oof, because that's the covenant with the BRs, not because Saturday is somehow not awful!). Reissuing 40M certs may take a dozen hours (1000/sec), even if all clients work optimally and begin immediately after OCSP changes status. During that dozen hours, those certs are all already revoked.

It'd be nice if clients could be told in advance: replace this certificate right away, regardless of its validity period. Get it done early before the crowd forms and replacement requires queuing up.

(Obviously using multiple CAs mitigates the downsides to waiting until revocation)


If there was "nothing wrong" with the certificate, it shouldn't be revoked. I get that there's red tape, but the policy surely exists for a reason. I'm obviously familiar with LE's prior mass revocations where, for the most part, there was no security concern -- but the thing is, the policy exists because we can't be sure. Those revocations were out of an abundance of caution because we couldn't be sure.

So yes, if we're being strict like the policy is, an early renewal signal is a red flag that a certificate can't be trusted. There might not be anything wrong with it, but we can no longer be sure.

I mean, there's also nothing wrong with a certificate 2 seconds after it expires. Probably. But we can't be sure. And because of that, we immediately distrust the certificate when it expires. (There might be something wrong with it before it expires too. But that's less likely because less time has passed, so we allow it, I guess.)

I think the vision is nice. I really do. I just think the clients that need it most won't support it.


must-staple requires additional effort from server operators, which is very challenging to roll out. This new scheme can be implemented by the browsers and CAs together, which is vastly simpler than having every webserver in the world turn on OCSP stapling.

There isn’t any particular force driving for OCSP stapling: Browsers can’t turn it on until it is ubiquitous. CAs don’t want to enforce must-staple because their users will need enormous amount of help rolling it out. Site operators don’t care if browsers are fetching OCSP (especially since Chrome and Edge don’t)


OCSP stapling increases how large the first response from the server is, increasing the time to first useful byte.


It's on the order of a few hundred bytes, only happens on the first connection, and is still orders of magnitude faster and more effective than the alternative (no stapling, which requires the client to make separate round-trips to a third party service which has all sorts of other problems).

OCSP stapling is temporary, anyway: just until we get certificate lifetimes short enough to make revocation irrelevant.*

* This is how it should be, but the industry seems to be going the direction of keeping cert lifetimes longer.


A few hundred bytes is a ton data for something that a lot of people don’t particularly care about, and is still broken despite OCSP stapling. People don’t have low latency, clean internet all over the world and adding something like this on top of an already pretty bloated negotiation doesn’t help.

As for the alternative not a single browser out there validates OCSP out of the box on the hot path (e.g actually waits for OCSP before proceeding)


This is a brower problem, but it is configurable.


Speaking from a PKI perspective you may head towards shorter lived certificates in many use-cases, NPE / automatable re-keying being examples of that.

However, for strong PKI on a consumer or enterprise encryption/decryption device, e.g. smartcards, there is big costs associated with shortening lifetimes. Not going to fly there.


I also love OCSP stapling but there are some limitations: - webserver need to implement it - admin need to enable it - webserver need internet access

Maybe they saw with some telemetry that very few website actually enable OCSP stampling and decided to implement a fix that cover all certs and can really be deployed


Firefox Telemetry shows Beta 104 users encountered stapling on 13.95% of TLS handshakes. [1]

The stapling telemetry is no longer turned on in Release [2], and even if it were, you have to do special things to look at Release data, but some years back (~2018 maybe?) I remember Release stapling was substantially lower than the more tech-savvy Beta and Nightly populations. Which is pretty normal, as tech-oriented sites are more likely to turn on advanced features.

[1] https://telemetry.mozilla.org/new-pipeline/dist.html#!cumula...

[2] "prerelease" https://probes.telemetry.mozilla.org/?search=stapl&view=deta...


OCSP Stapling is incredibly stupid. OSCP Stapling is either ignored because it's not present, or it's effectively just an override for the NotBefore and NotAfter fields. There's never a reason as a server operator to present an invalid NotAfter, and it represents a substantial surface area for client libraries to respect them, in the area that's already most prone to mistakes (the recursive descent to find a certificate chain)

It's much saner to issue a new, shorter-lived cert. Reduce certificate lifetimes to match whatever you'd use OCSP stapling for. Continue to use ACME. Figure out rotation lifetimes such that I can sleep at night (i.e. they need to be at least 2 days, so that if the rotation fails I've still got time to wake up and start working). Work on building client and server tooling to make it easy to accept rotated certificates (i.e. reload your client certificates on SIGHUP).

The advantage of CRLs isn't that they can be used offline, it's that it allows your security team to burn certificates rather than waiting for the CA to do so. You should always subscribe to your CA's CRL too, but you should have a CRL for your own internal use, too.


> The advantage of CRLs isn't that they can be used offline, it's that it allows your security team to burn certificates rather than waiting for the CA to do so. You should always subscribe to your CA's CRL too, but you should have a CRL for your own internal use, too.

This is incorrect conventionally

CRLs are signed by the Originating CA in the conventional trust model.

Most CAs won't give you a certificate with the ability to generate your own CRL sharing the same trust, because it can be weaponized to cause a denial of service.

This then brings up non-conventional trust models, and Validation Authority as Co-equal to CA situations for validity which can be significantly more challenging, and requiring VA certificate insertion into every Relying Party, just to name a few things, not to mention all the security concerns.


I'm aware. That trust model is inherently broken, and it's not followed anyway. In practice, any CA you trust can sign a revocation for any certificate (the serial numbers are all grouped together) - And that's the way it should be. You should not wait for the issuing CA to get it's act together, you should burn a certificate as soon as anyone distrusts it, and if you have a rogue CA that starts burning Google.com and other important sites, it's better that they burn them (and take you offline) than that they issue false certs (and leak all your data); It's also far more obvious.

https://www.imperialviolet.org/2014/04/19/revchecking.html

Revocation checking is useful for your security team to blacklist site. That's the only useful use.


> If we had an incident where we needed to revoke every single one of those certificates at the same time, the resulting CRL would be over 8 gigabytes.

I don't know much about this stuff, so apologies if this is a silly question:

If you needed to revoke all the certificates, couldn't you just revoke the handful of intermediary certificates and call it a day? I assume you'd want to revoke them anyways if there's a situation severe enough that merits revoking 200 million certificates.


Yes, it is likely we’d revoke an intermediate if a significant fraction of all issued certs had to be revoked. But we do want our revocation infrastructure to support revoking all certs if needed.

We have a set of backup intermediates that can be activated if we had to revoke the active ones for any reason, so the disruption wouldn’t be too high hopefully.

(I work at Let’s Encrypt, but this is my own opinion and not that of my employer)


It's a good question, if I read you rightly.

There is no "handful" of intermediate certificates -- there are precisely 4 (for Let's Encrypt [0]) and they are essentially on-line root certificates. And if those certificates aren't even compromised, revoking them would only harm the ecosystem.

[0]: https://letsencrypt.org/certificates/


According to the link you posted, there are intermediate CAs. Those could be revoked and effectively revoke all the end entity certificates.


When they write "essentially on-line root certificates" they don't mean that they're literally on-line root certificates, because that's prohibited.

They're essentially on-line root certificates because they serve most of the function that such roots would serve if they were allowed. There aren't a bunch more available to replace them, so this means recovery now requires a key ceremony, figure on a week to a month to arrange that.

Whereas if you're able to "just" revoke 10 million end entity certificates you can recover immediately.


> This means that they’re often very large – easily the size of a whole movie.

Couldn't they just use actual units? This says absolutely nothing.


The article does later state Lets Encrypt's own expected worst case is 8 GBs in one CRL file in the hopefully unlikely scenario that every unexpired certificate they manage was revoked.


Merkle Hash Trees are the well-known solution for this, whenever they decide to update the protocols (was software patent encumbered til 2017)


Merkle Hash Trees are not a solution for offline checks. You can very easily check membership in a tree if someone sends you a proof of membership. But to get a proof, you'd either have to generate it yourself (meaning you're storing the whole set anyway) or get it from a third party (ie an online lookup, which is what OCSP is). The forefront of CRL design is in set compression techniques like Ribbon filters https://arxiv.org/pdf/2103.02515.pdf


I never said MHT are a solution for offline checks. That is a completely different problem space.

In the space domain, satellite ground systems have used a combination of guards, one way transmission, and others, although I am in favor of system-specific certificate whitelisting via running OCSP with local VAs, backed up by smart clients, with CRLs on the filesystem and in network shares.

CRL processing can cause timeouts, which is why it is always less preferred than the much more lightweight OCSP by comparison.

Furthermore, CRLs are blacklists, and newer OCSP is not exclusively, and, OCSP permits more flexibility with the various trust models that exist for it.


It sounds close to what the article suggests Let's Encrypt is using internally to store/generate the CRL files, though the article doesn't get into a lot of technical details to tell exactly if that is the case.


No. That isn't the same.

CRLs and OCSP are about individual certificates.

The beauty of Merkle Hash Trees for validation is it permits Statements e.g.

CAx = CA2 and 156 ≤ X < 343 (5)

The statement cj indicates that the certificate with serial number X = 156 issued by CA2 has been revoked, while the certificates with serial numbers from X = 157 to X = 343 (both included) issued by CA2 have not been revoked.

From https://www.researchgate.net/publication/220066804_Certifica...


Wouldn't a CRL that lists all the revoked certificates tell you that exact same information?


CRL's format is to list all the serial numbers in a blacklist, which means a CRL for a large well-used PKI is going to be gigantic by comparison to an MHT.

A CRL is not equivalent to modern OCSP, because modern OCSP can operate as a whitelist or a blacklist. However, CRLs are generally more secure than some of the OCSP variants because strong OCSP is used less (Nonced+authenticated/registered), and CRLs are not subject to replay attack like vanilla OCSP can be.


All the revocation checking mechanism have their pitfalls. It is a so complex issue. CRL has a very important benefit, namely it can be prefetched and can be updated regularly which is important in the case a firewall solution for instance. Read the following article for more details.

https://dev.to/coroner/why-do-certificate-revocation-checkin...


One Olympic swimming football bus tree worth of data.

When I was reading the thread the other day about a trees worth of oxygen from the MOXIE experiment, I couldn't help thinking: why not just use a term everyone is familiar with, litres per minute air.

Enough oxygen to sustain an adult at rest for x minutes.

I'm beginning to suspect there's an in-joke with science / tech writers about strained analogies.

And I'm not in.


> I'm beginning to suspect there's an in-joke with science / tech writers about strained analogies.

Which has been going on for longer than I've been alive. The idea (I assume) is to take a large number that's hard to conceive and turn it into something everyone can relate to.

But inevitably, they choose things that few can actually relate to, or things that are so vague/variable to be meaningless. It just adds more confusion all around.

It has to be an intentional joke.


Browsers and CAs still deciding for the user what's best with yet another centralized database that we have to "trust" is complete and implemented correctly. I don't see how this is so hard: just let us download the CRLs. Maybe add them on a torrent-like system so they can be shared (and validated) peer-to-peer, or at least have hundreds or thousands of mirrors that can provide that data. All this "it's too big and the user would have to download it" bullshit is just pushing us more and more into "computing as a service"; aka: we OWN your digital life.

Edit: I was partly wrong, this is a good thing because you CAN download the CRLs now (see comments below here for info), whereas previously you couldn't. Your browser still probably won't support a full CRL download, but I could be pleasantly surprised.


> I don't see how this is so hard: just let us download the CRLs

Previously, you couldn't do this, because not all CAs published CRLs.

Beginning October 1, you will be able to just download the CRLs, because Apple and Mozilla are requiring it.

It's therefore unclear what your beef is,


> Beginning October 1, you will be able to just download the CRLs

Correction: Apple and Mozilla will be able to just download the CRLs. Not me. The link in the post SPECIFICALLY says us common plebes don't get that right.


Where does the post say that?

If you think it's because the URLs will be disclosed in the CCADB, note that the contents of the CCADB are published here: https://www.ccadb.org/resources

Specifically, the CRL URLs can be found in this CSV file: http://ccadb-public.secure.force.com/ccadb/AllCertificateRec...


I was pretty sure this section meant what I said but maybe you can get them from that database without being a BigCo?:

“Our new CRL URLs will be disclosed only in CCADB, so that the Apple and Mozilla root programs can consume them without exposing them to potentially large download traffic from the rest of the internet at large.”


I assumed what they meant is that the database is publicly available but that browser implementations won't be directly pulling CRLs. Instead the browser providers pull the CRLs and create a compressed version that their browser users download.

In the same way that you can technically query the DNS root servers yourself but you don't tend to do that because your computer will query a more downstream DNS server.


Yes, that's exactly what it means.


I have a cron job that pulls that CSV file once a day. I assure you I am not a BigCo.


I was wrong, thanks for correcting me :)


"The connection has timed out. An error occurred during a connection to ccadb-public.secure.force.com."


Works for me, though the time to first byte is currently rather long.


> "They process the CRLs into a smaller format such as a Bloom filter, then push the new compressed object to all of the installed browser instances using pre-existing rapid update mechanisms. Firefox, for example, is pushing updates as quickly as every 6 hours."

Does anyone know what the rate of revocations is?

I can easily imagine a situation where it is high enough to cause a browser update every few hours. That is for every installed browser, because - as they say in the article - Browser-Summarized CRLs are "proprietary, browser-specific CRLs". Moreover there are non-browser clients which we must consider if we are to take this proposal seriously.

------

My quick back of the envelope estimation:

CRL size: 4GiB (they say in the article that it could be easily the size of a movie)

Average Cert Size: 75 bytes (first hit in Google, no idea if reliable number)

Time Span: 825 days (CRLs have only unexpired certs and ones older than 825 should all be expired)

4GiB/(75B/cert)/825days = 69141 certs/day

Sounds way too high to me. Where am I wrong?


In the CRLite paper [1] the bandwidth cost was estimated to be ~600 KB per day. Most crucially they do not directly store certificates but only boolean flags for hashes of virtually all once-valid certificates in existence. This is made possible because of Certificate Transparency; once the signature has been verified there can be only so many hashes to check.

[1] https://obj.umiacs.umd.edu/papers_for_stories/crlite_oakland...


> There’s still a long way to go before revocation in the Web PKI is truly fixed.

It won't be fixed until we have name constraints on CA certificates, and a way to decorate trust anchors with local policy name constraints.


Lots of this information is completely bogus.

> But because OCSP infrastructure has to be running constantly and can suffer downtime just like any other web service, most browsers treat getting no response at all as equivalent to getting a “not revoked” response. This means that attackers can prevent you from discovering that a certificate has been revoked simply by blocking all of your requests for OCSP information.

This is false. Non-nonce OCSP is inherently cachable, and replayable. That means you can have your own HA setups with HA OCSP clients talking to HA OCSP servers (repeaters & responders) backed up by caching in commercial CDNs, and local caching servers like bluecoats.

Likewise, OCSP stapling helps remove much of the performance and privacy issues, pushing it to the serving webserver.

Beyond this, you can just use squid or localized HA OCSP services, and do some DNS rewriting to support it even more HA.

Nonced OCSP is the rare beast that needs to be online, but there are HA OCSP with smart OCSP clients.

> To help reduce load on a CA’s OCSP services, OCSP responses are valid and can be cached for about a week. But this means that clients don’t retrieve updates very frequently, and often continue to trust certificates for a week after they’re revoked.

Trust Stores are inherently manageable. The lag around revocation completely depends on CRL/OCSP publishing, and client update requests.

> And perhaps worst of all: because your browser makes an OCSP request for every website you visit, a malicious (or legally compelled) CA could track your browsing behavior by keeping track of what sites you request OCSP for.

This is why we advocate OCSP Stapling and use of CDNs for OCSP & CRL cache hits. Furthermore, localized OCSP mentioned above decentralizes this even further.

> So both of the existing solutions don’t really work: CRLs are so inefficient that most browsers don’t check them, and OCSP is so unreliable that most browsers don’t check it. We need something better.

CRLs & OCSP work pretty well when actually supported.

When Diginotar happened I polled every single publicly available commercial CA - strangely, a ton of them were not producing any CRL/OCSP at all, putting clients into a fail-open mode.

Lesson of the story: don't blame a protocol for lazy CAs, bad implementations, or the lack of operational excellence from many vendors.


I can't wait for some big site certs to false positive in a CRL bloom filter and cause a big outage : )


CRLite builds a cascade of Bloom filters to ensure no false positives.

For Firefox end users, a certificate only gets tested against the filter cascade if it is known to have been included in its creation (by examining the embedded SCT timestamps). If it's not definite that the certificate was used to generate the filter, then Firefox reverts to OCSP.

(I'm one of the authors of CRLite in Firefox: https://insufficient.coffee/2020/12/01/crlite-part-4-infrast... )


Ahh thanks for the link and sorry for the snark; the finite universe optimization is cool! [0]

  # Why is CRLite able to compress so much data?
  
  Bloom filters are probabilistic data structures with an error rate due to data collisions. However, if you know the whole range of data that might be tested against the filter, you can compute all the false positives and build another layer to resolve those. Then you keep going until there are no more false positives. In practice, this happens in 25 to 30 layers, which results in substantial compression.
EDIT: Is there any risk of filter blow up (think 1000's of layers) if a CA did a mass revocation (maybe some root key leak)?

[0] https://github.com/mozilla/crlite/wiki#why-is-crlite-able-to...


Mozilla worked with some researchers to analyze a bunch of degenerate cases like that and basically the answer is 'no', that the CRLite paper's calculation of an optimal false positive rate per layer works out quite well.

What does happen in CRLite is that you can't keep shipping the tiny "stash" updates to clients, you have to mint a whole new .mlbf filter file, which is about a megabyte. [Edit:] Then you can resume the "stash" updates from there, but the ecosystem 'shock' requires a regeneration of the filter.

(There was supposed to be a blogpost on the Mozilla blog from the research teams; I don't know if it was ever written.)


J.C. has already answered for Firefox; for Apple's system (valid.apple.com), if there's a bloom filter hit, the client double checks via OCSP before failing the connection.

Source: a WWDC 2017 talk which unfortunately I can't find online anymore


A concern is: Does it fail closed? It's easy to imagine an Apple engineer finds this fails sometimes for crap-CA whose OCSP server is kinda-sorta maybe working, in Bremen at least, usually, although not on weekends, and they go "Oh, I can fix this, just ignore if it fails" and now we've got our old friend the "Seatbelt that snaps when you crash" back.

Mozilla's choice here avoids that problem coming up which means nobody needs to push back when it gets "solved" in this regressive way.


That is a very good question and I don't know the answer, but if you want to go source code spelunking, you can probably find the answer on https://opensource.apple.com/


What about if you’re calling an API? Is there compressed CRL support in libcurl or similar?


Is certificate revocation really so common a crl of a normal authority is gigabyte sized?


According to [1] "On a typical Monday, we would expect to see a total of around 22,000-30,000 SSL certificates being revoked over the course of the day." i.e. ~8 million per year. [3] meanwhile says "1.8 million certificates are revoked per year"

Looking at a random CRL [2] it's 41 bytes per revoked certificate.

8 million records at 41 bytes per record would be 300+ Megabytes. And a cautious CA might keep revoked certificates in their CRL for more than a year.

So if an event like heartbleed happened again and uncommonly large numbers of certificates needed to be revoked, the gigabyte range is within the bounds of possibility.

[1] https://news.netcraft.com/archives/2014/04/11/heartbleed-cer... [2] http://crl3.digicert.com/Omniroot2025.crl [3] https://www.grc.com/revocation/crlsets.htm


> And a cautious CA might keep revoked certificates in their CRL for more than a year.

A CA does need to keep the revoked certificate in the CRL until at least the natural expiration date on the certificate, so for the CAs that give certificates out with 5 years out or more expiration dates, they may need to keep CRLs for much longer than just a year just naturally by nature of their expiration dates.


This CRL rule is about the Web PKI, for which currently policy requires new certificates expire after no more than 398 days.

The previous status was the certificates could last up to 825 days, however that policy changed at the end of August 2020, so there are no extant certificates under those rules which expire after this year. And before that the policy was 39 months, but the last such certificate expired in 2021. Before that the policy was 5 years, but that policy changed in 2015 and so such certificates are long expired.


The problem is, you need to design for the worse case scenario here, because when the worse case does happen, the last thing you need is for your revocation system to not work because it doesn't scale.

So, no, it's not common. But it's necessary.


No, not at all.

The bad case would be if Let's Encrypt discovers a problem (like a security flaw or implementation error in a validation method, as happened with the TLS-ALPN-01 method before) and concludes that it has to mass-revoke a very large number of affected certificates.


Or perhaps if another CloudBleed style incident ever happens.


The DoD uses x509, and CRLs, in those Common Access Cards (CAC) everyone in the org has. Since this covers most of the armed forces, that's fairly large. As of 2012[1] this was around 200 MB of CRLs and was only expected to get larger over time.

[1] https://dl.dod.cyber.mil/wp-content/uploads/pki-pke/pdf/uncl... - Pg 7, under Local Cache


As mentioned above, DoD uses smart strategies around certificate validation.

CDNs + Localized OCSP + Tactical OCSP + Smart OCSP Clients + Network caching + OCSP & CRLs on the filesystem, just to name a few (not including delta CRLs and other solutions).

The DoD OCSP Responders are configured to share hash sets with downstream OCSP Responders & Repeaters, which makes promulgation particularly easy.


Shitty title for an article considering recent events




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: