Thanks for taking the time to read through this and write some feedback for me. I sincerely appreciate it!
I wrote this post late last night, so pardon the delay with responding. Sleep happens.
> It seems to me that all you're doing is providing encryption-at-rest-as-a-service. Why shouldn't your clients simply skip the middle-man and encrypt the data at rest themselves (entirely avoiding the traffic and costs incurred with using your services)?
There is nothing stopping clients from making that call for themselves. At my previous employers, I've built similar systems multiple times. In those cases though, we had always checked first for any open source solutions. At that time, none of them fit the bill though so we ended up building it in house.
Which leads into your second point about "avoiding traffic and costs". We're making this open source and something that clients can self-host themselves precisely for that reason. Other players in the "Tokenization" market aren't open source or even generally self-hostable. That's one of the key differentiators of what we're building.
> Moreover, why should clients trust you with their sensitive customer content, encryption not withstanding?
Well, they don't have to. It's open source. They can check the code out themselves. And, with way we've designed the system, there is no "single point of failure" that results in leaking all of the data.
> What are your encryption-at-rest practices and how can you guarantee they are future-proof?
The encryption-at-rest uses AES-256-GCM which is implemented by Amazon S3. So, that part of the puzzle is well solved.
The rest of our system uses off-the-shelf crypto hashing (SHA-3). For the key derivation algorithms, we've implemented NIST SP 800-108 [0]. The key derivation is basically a cryptographically secure random number generator using the output of the SHA-3 hash as the seed. We use it to generator multiple random values. I'll expand on this in the docs soon (and you'll be able to read the source code).
We're intentionally not attempting to do anything novel with actual crypto math. We're just using existing, basic primitives and chaining them together (again, in accordance with the NIST paper I linked).
> And finally - your API is going to be a major single-point-of-failure for your clients. If you're down, they're down. How do you intend to mitigate that?
Well, it's open source and self-hosted. That's one of the primary goals for the system in order to _avoid_ this use case. At my previous employers, when we evaluated vendor solutions, those were both blockers to our adoption. Being beholden to a questionable vendor is a crappy situation to be in when you have 5+ 9s to maintain.
A common approach to adding "Tokenization" to apps (used by companies like VeryGoodSecurity) is to introduce an HTTP proxy with request rewriting. They rewrite requests to perform the tokenization/detokenization for you. It's simple to onboard with, but it has a ton of caveats (like them going down and tanking your app).
We've also designed this to "gracefully degrade". The "Secure Components" that live in the browser are individual fields. If LunaSec goes down, then only those inputs break. It's possible that breaks sign-ups and is also crappy, but at least not _everything_ will break all-at-once.
Finally, we've also designed the backend "Tokenizer" service to be effectively stateless. The only "upstream" service that it depends on it Amazon S3. And that's the same as the front-end components. By default, Amazon S3 has 99.99% availability. We have plans to add geo-replication support that would make that 6+ 9s of availability by replicating data.
> What if an attacker figures out how the decryption key is "deterministically" derived?
This is a real attack scenario, and something we've designed around. I'll make sure to write some docs to elaborate on this soon.
TL;DR though: If an attacker is able to leak the "Tokenizer Secret" that is used to "deterministically derive" the encryption key + lookup values, then they will _also_ need to have a copy of every "Token" in order for that to be valuable. And, in addition, they also need access to read the encrypted data too. By itself, being able to derive keys is not enough. You still need the other two pieces (the token and the ciphertext).
> You would need to re-encrypt the original customer content AND somehow fix the mappings between the old tokens your client keeps in their database, and the new ones you'd have to generate post changing the algorithm. This is an attack that brings down your whole concept.
You're right that this is a painful part of the design. The only way to perform a full rotation with a new "key derivation algorithm" is to decrypt with the old key and re-encrypt everything with the new key.
That's the nature of security. There is always going to be some form of tradeoff made.
Fortunately, there is a way to mitigate this: We can use public-key cryptography to one-way encrypt a copy of the token (or the encryption keys, or all of the above). In the event of a "full system compromise", you can use the private key to decrypt all of the data (and then re-encrypt it without rotating the tokens in upstream applications).
For that case, you would need to ensure that the private-key is held in a safe place. In reality, you'd probably want to use something like Shor's algorithm to require multiple parties to collaborate in order to regenerate the key. And you'd want to keep it in an safe deposit box, probably.
> Then, there's issues like idempotency. Imagine a user accessing a control panel where they can set their "Display Name" to whatever they like. With your current design, it looks like you'll be generating new records for each such change. Isn't that wasteful? What happens to the old data?
We did intentionally choose for this to be immutable because allowing mutable values opens up an entirely separate can of worms. Being able to distribute the system becomes a much harder problem, for example, because of possible race conditions and dirty-read problems. Forcing the system to be immutable creates "waste" but it enables scalability. Pick your poison!
For old data, the approach we're using is to "mark" records for deletion and to later run a "garbage collection" job that actually performs the delete. If a customer updated their "Display Name", for example, then the flow would be to generate a new token and then mark the old one for deletion. (And using a "write-ahead-log" to ensure that the process is fault-tolerant.)
> Also, what happens if your clients lose their tokens somehow?
This is again another tradeoff of security. By removing the Tokens from the Tokenizer entirely, you gain security at the expense of additional complexity (or reduced usability). You make it harder for an attacker to steal your data by also requiring them to get their hands on tokens, but you also force yourself to not lose access to your tokens in order to read data. It becomes very important to take backups of your databases and ensuring that those backups can't easily be deleted by an attacker.
This is mitigated with the "token backup vault using public-key" strategy I outlined above. But if you somehow lost those keys, then you'd be in a bad spot. That's the tradeoff of security.
> Does the data stay in your possession forever?
It's self-hosted by default. (Well, technically Amazon S3 stores the data.)
We may eventually have a "SaaS" version of the software, but not right away. When we do get there, we'll likely continue relying on S3 for data storage (and we can easily configure that to be a client-owned S3 bucket).
> I suggest you guys to get a serious security audit done as early as possible (by a reputable company) before proceeding with building this product.
It's on the roadmap to get an independent security review. At this point in time, we're relying on our shared expertise as Security Engineers to make design decisions. We spent many months arguing about the exact way to build a secure system before we even started writing code. Of course, we can still make mistakes.
We have some docs on "Vulnerabilities and Mitigations" in the current docs[1]. We need to do a better job of explaining this though. That's where getting feedback like yours really helps us though -- it's impossible for us to improve otherwise!
> Some of this just reads like nonsense at the moment.
That's on me to get better at. Writing docs is hard!
Thanks again for taking the time to read the docs and for the very in-depth feedback. I hope this comment helps answer some of the questions.
We've spent a ton of time trying to address possible problems with the systems. The hardest part for us is to convey that properly in docs and to help build trust with users by you. But, that's just going to take time and effort. There is no magic bullet except to keep iterating. :)
I wrote this post late last night, so pardon the delay with responding. Sleep happens.
> It seems to me that all you're doing is providing encryption-at-rest-as-a-service. Why shouldn't your clients simply skip the middle-man and encrypt the data at rest themselves (entirely avoiding the traffic and costs incurred with using your services)?
There is nothing stopping clients from making that call for themselves. At my previous employers, I've built similar systems multiple times. In those cases though, we had always checked first for any open source solutions. At that time, none of them fit the bill though so we ended up building it in house.
Which leads into your second point about "avoiding traffic and costs". We're making this open source and something that clients can self-host themselves precisely for that reason. Other players in the "Tokenization" market aren't open source or even generally self-hostable. That's one of the key differentiators of what we're building.
> Moreover, why should clients trust you with their sensitive customer content, encryption not withstanding?
Well, they don't have to. It's open source. They can check the code out themselves. And, with way we've designed the system, there is no "single point of failure" that results in leaking all of the data.
> What are your encryption-at-rest practices and how can you guarantee they are future-proof?
The encryption-at-rest uses AES-256-GCM which is implemented by Amazon S3. So, that part of the puzzle is well solved.
The rest of our system uses off-the-shelf crypto hashing (SHA-3). For the key derivation algorithms, we've implemented NIST SP 800-108 [0]. The key derivation is basically a cryptographically secure random number generator using the output of the SHA-3 hash as the seed. We use it to generator multiple random values. I'll expand on this in the docs soon (and you'll be able to read the source code).
We're intentionally not attempting to do anything novel with actual crypto math. We're just using existing, basic primitives and chaining them together (again, in accordance with the NIST paper I linked).
> And finally - your API is going to be a major single-point-of-failure for your clients. If you're down, they're down. How do you intend to mitigate that?
Well, it's open source and self-hosted. That's one of the primary goals for the system in order to _avoid_ this use case. At my previous employers, when we evaluated vendor solutions, those were both blockers to our adoption. Being beholden to a questionable vendor is a crappy situation to be in when you have 5+ 9s to maintain.
A common approach to adding "Tokenization" to apps (used by companies like VeryGoodSecurity) is to introduce an HTTP proxy with request rewriting. They rewrite requests to perform the tokenization/detokenization for you. It's simple to onboard with, but it has a ton of caveats (like them going down and tanking your app).
We've also designed this to "gracefully degrade". The "Secure Components" that live in the browser are individual fields. If LunaSec goes down, then only those inputs break. It's possible that breaks sign-ups and is also crappy, but at least not _everything_ will break all-at-once.
Finally, we've also designed the backend "Tokenizer" service to be effectively stateless. The only "upstream" service that it depends on it Amazon S3. And that's the same as the front-end components. By default, Amazon S3 has 99.99% availability. We have plans to add geo-replication support that would make that 6+ 9s of availability by replicating data.
> What if an attacker figures out how the decryption key is "deterministically" derived?
This is a real attack scenario, and something we've designed around. I'll make sure to write some docs to elaborate on this soon.
TL;DR though: If an attacker is able to leak the "Tokenizer Secret" that is used to "deterministically derive" the encryption key + lookup values, then they will _also_ need to have a copy of every "Token" in order for that to be valuable. And, in addition, they also need access to read the encrypted data too. By itself, being able to derive keys is not enough. You still need the other two pieces (the token and the ciphertext).
> You would need to re-encrypt the original customer content AND somehow fix the mappings between the old tokens your client keeps in their database, and the new ones you'd have to generate post changing the algorithm. This is an attack that brings down your whole concept.
You're right that this is a painful part of the design. The only way to perform a full rotation with a new "key derivation algorithm" is to decrypt with the old key and re-encrypt everything with the new key.
That's the nature of security. There is always going to be some form of tradeoff made.
Fortunately, there is a way to mitigate this: We can use public-key cryptography to one-way encrypt a copy of the token (or the encryption keys, or all of the above). In the event of a "full system compromise", you can use the private key to decrypt all of the data (and then re-encrypt it without rotating the tokens in upstream applications).
For that case, you would need to ensure that the private-key is held in a safe place. In reality, you'd probably want to use something like Shor's algorithm to require multiple parties to collaborate in order to regenerate the key. And you'd want to keep it in an safe deposit box, probably.
> Then, there's issues like idempotency. Imagine a user accessing a control panel where they can set their "Display Name" to whatever they like. With your current design, it looks like you'll be generating new records for each such change. Isn't that wasteful? What happens to the old data?
We did intentionally choose for this to be immutable because allowing mutable values opens up an entirely separate can of worms. Being able to distribute the system becomes a much harder problem, for example, because of possible race conditions and dirty-read problems. Forcing the system to be immutable creates "waste" but it enables scalability. Pick your poison!
For old data, the approach we're using is to "mark" records for deletion and to later run a "garbage collection" job that actually performs the delete. If a customer updated their "Display Name", for example, then the flow would be to generate a new token and then mark the old one for deletion. (And using a "write-ahead-log" to ensure that the process is fault-tolerant.)
> Also, what happens if your clients lose their tokens somehow?
This is again another tradeoff of security. By removing the Tokens from the Tokenizer entirely, you gain security at the expense of additional complexity (or reduced usability). You make it harder for an attacker to steal your data by also requiring them to get their hands on tokens, but you also force yourself to not lose access to your tokens in order to read data. It becomes very important to take backups of your databases and ensuring that those backups can't easily be deleted by an attacker.
This is mitigated with the "token backup vault using public-key" strategy I outlined above. But if you somehow lost those keys, then you'd be in a bad spot. That's the tradeoff of security.
> Does the data stay in your possession forever?
It's self-hosted by default. (Well, technically Amazon S3 stores the data.)
We may eventually have a "SaaS" version of the software, but not right away. When we do get there, we'll likely continue relying on S3 for data storage (and we can easily configure that to be a client-owned S3 bucket).
> I suggest you guys to get a serious security audit done as early as possible (by a reputable company) before proceeding with building this product.
It's on the roadmap to get an independent security review. At this point in time, we're relying on our shared expertise as Security Engineers to make design decisions. We spent many months arguing about the exact way to build a secure system before we even started writing code. Of course, we can still make mistakes.
We have some docs on "Vulnerabilities and Mitigations" in the current docs[1]. We need to do a better job of explaining this though. That's where getting feedback like yours really helps us though -- it's impossible for us to improve otherwise!
> Some of this just reads like nonsense at the moment.
That's on me to get better at. Writing docs is hard!
Thanks again for taking the time to read the docs and for the very in-depth feedback. I hope this comment helps answer some of the questions.
We've spent a ton of time trying to address possible problems with the systems. The hardest part for us is to convey that properly in docs and to help build trust with users by you. But, that's just going to take time and effort. There is no magic bullet except to keep iterating. :)
Cheers!
0: https://csrc.nist.gov/publications/detail/sp/800-108/final
1: https://www.lunasec.io/docs/pages/overview/security/vulns-an...