Hi all, Arash from Dropbox here. We understand the concern that the government could try to guess whether a particular file has been uploaded to Dropbox based on processing times and then request that Dropbox identify a user who has access to that file. However, to seek user content information, the government needs to comply with the provisions of the Electronic Communications Privacy Act by obtaining a warrant supported by probable cause (or in some cases a court order from a judge). Those safeguards protect user privacy. De-duplication does not make users any more vulnerable to intrusive government actions. Today, a government agency could ask any online service to provide the names of all users who have a particular file, whether or not the service employs de-duplication. And in that case, the government would also need to support its request with a warrant or court order. The rules that provide a check against unwarranted government snooping apply to online services equally, regardless of their back-end architecture.
Granted, but the point of the article is that other services which do not have this ability are not vulnerable to said orders, they cannot do what they do not have the ability to do.
Still, this lets the government probe (as an ordinary user) to know that at least one prior user has a contraband file, without any warrant. That could be the thread of probable cause they use to take the next steps.
If you don't mind my asking, what is the percentage savings achieved by de-duplication across all of Dropbox? Some others here have wondered if it was premature optimization.
It wasn't a premature optimization. It was both a better experience for the user (saves bw/reuploads for the user) and was simpler to implement (can keep things in one global bucket) given we didn't want things like renames to trigger reuploads and had to use checksums as a result.
But you could prevent reuploads with per-user de-duplication, while avoiding the privacy issue of cross-user de-duplication.
I could see why this would be more work to implement (you have to key on user+contenthash), but it would still be interesting to know how much Dropbox and its users actually benefit from cross-user de-duplication.
I have a hard time understanding this line of argument.
Per-user deduplication will mean, I need not upload the same file twice into my own account? What's the use of this?
I keep some of my 'paid for' software installables backed up in my Dropbox, and they tot up to ~1.5 GB (the Humble Indie Bundle games). When I started the upload however, it took maybe 5 seconds because of cross-user deduplication, and I am super grateful to them for this feature.
I imagine this feature saves users tons of bandwidth, as most of the people I know use Dropbox for backing up important software, rare music and videos.
> I have a hard time understanding this line of argument.
It's not a line of argument. It's a line of inquiry. You've given anecdotal evidence that cross-user deduplication benefits you and people you know, but what about some actual numbers from Dropbox?
Producing actual numbers -- "eg cross-user deduplication saves our users 30% of their upload time and bandwidth, on average" -- seems like a great way for Dropbox to counter this issue.
> I imagine this feature saves users tons of bandwidth, as most of the people I know use Dropbox for backing up important software, rare music and videos.
We don't have to imagine! Let there be numbers!
Also -- "rare" music and videos that everyone's uploading duplicates of? ;)
To the point that it does not make users any more vulnerable to intrusive government actions, as you put it -- in practice, that's not going to be true.
Let's say right now Agent Bob or RIAA layer Cindy decides they want to know everyone in the country who has a copy of a file. There's no practical way for them to do that. But now, they can upload target.iso to dropbox, see that it uploads instantly, and all they have to do is get a court order to compel Dropbox to tell them all the other users who have that file.
Every user that has that file is now exposed. Dropbox is now a single point of failure for every user's privacy, and vulnerable to attacks from any legal court order -- and we've seen what can happen with the abuse of prosecutorial and judicial power. Copyright infringement's the obvious case, but when you start to consider how this ability to fish for files could be abused, and how tempting it'll be for people try to abuse it, it seems more serious than I think you're considering.
Has your organisation considered offering "no de-duplication" to paying subscribers?
Personally I have no qualms with it, but I know some might, and they could be willing to pay for it.
I think most of us, if we were developing Dropbox, would have made the same decisions. De-duping at the cost of complete privacy, even in the face of law, is a sound technical and business decision for a service such as Dropbox.
When I think Dropbox, I think sharing, I think convenience, I don't think backup and security. For backup I need more space, for security I need to use my own private key (not a password that one can change/recover). Neither of these things is offered by Dropbox. And this is the reason why I never confused Dropbox with, say, CrashPlan. One is a way to share and collaborate, the other is a place to send my private key encrypted bits to.
My individual privacy is not compromised by somebody being able to say if a certain file is stored by the entirety of Dropbox user base. The other claim, that, given a court order, Dropbox can be forced to turn over your files or tell the court if you store a certain file _may_ be true, but I don't think Dropbox, the company, has ever promised that level of security.
"What this means, is that from the comfort of their desks, law enforcement agencies or copyright trolls can upload contraband files to Dropbox, watch the amount of bandwidth consumed, and then obtain a court order if the amount of data transferred is smaller than the size of the file."
This, I think is significant, especially if Dropbox is advertising security, privacy and encryption. As the author mentions, the ToS are being updated to reflect the above possibility ("if Dropbox receives a warrant, it has the ability to remove its own encryption to provide data to law enforcement").
On the face of it, the security of DropBox encryption seems comparable to that of a padlocked room - where the key to the padlock is kept hidden in a safe place.
Except the key is actually kept under the doormat, because people keep going in and out and keeping it elsewhere is just too inconvenient.
If you told someone a room like that had military grade security, you would be called a liar. No matter how fancy the padlock. Without knowing the details of DropBox's setup I'm going to refrain from calling them liars, but this all seems pretty fishy to me.
The analogy you mention is not very good because the hole discovered in their security model is related to duplicate content identified via the same hash value. If you have unique content that's different than anything else, even by 1 bit, then you're secure until someone uploads exactly the same content (this is due to the way in which hash functions work); meanwhile, a key under the doormat would imply a totally different security threat model.
The downvotes I'm getting seem to indicate others agree I'm off the mark here. Perhaps I was too inflammatory, perhaps people just don't understand my analogy. Let me try again without the hyperbole. :-)
My point is, DropBox advertise proudly on their website that they use military grade encryption to protect their users' data. However, it has now been independently shown now that the keys to this data are in DropBox's direct possession and are in routine, daily use, decrypting one person's data so another can access it (this is what happens when deduping allows you to download something you never actually uploaded yourself).
To me, this implies that their claims of "military grade security" may be unjustified and just yet another example of security theater in the cloud.
Without knowing the exact architecture of their system it's hard to say for sure, of course. But think about what the encryption they claim to use is probably supposed to accomplish. Then think about whether it actually does that if a large proportion of DropBox's servers and employees have access to the decryption keys.
[edit: Amazon store data on S3, so it is in fact important that they encrypt it (even with relatively relaxed key management) as they have no direct control over the infrastructure. I still don't think this meets the bar of "military grade security", but I guess that's marketing for you.]
Modern deduplication breaks files into pieces, and a list of hashes is computed for each piece. So a single bit change wouldn't necessarily throw it off completely.
Bottom line - you should assume that you cannot trust ANY cloud-based service to keep what you upload safe from hackers (if they're determined enough), and especially from governments. I'm not sure why anyone would be under the illusion that this is the case. Even assuming that hosting such a service would be legally possible today (not sure if it is, IANAL), it could be illegal tomorrow and the service may be compelled to hand over any data.
If you want something to stay secret you MUST either:
a. Not put it on the internet/cloud.
b. Encrypt it yourself before uploading. Yes, there are trust issues with commodity encryption software as well, but these may be mitigated somewhat.
This really isn't an issue for legitimate users backing up or syncing original data. If you are paranoid about privacy, then you are probably doing something wrong. If for some reason some other user produces the same file that I have, then whats the big deal? They already arrived at that information themselves, so there is still no compromise of data. Of course the government can seize your data whether it is online or not. Don't sync copyrighted material. How many different ways do they have to tell you it's illegal?
"If you are paranoid about privacy, then you are probably doing something wrong."
That's a pretty slippery slope to presume that a desire for privacy is tantamount to an admission of guilt. Why would you presume that simply because I have information I want to keep secret, I must be doing something wrong?
Conversely, is it ok if we install webcams throughout your house, since apparently you have nothing you'd like to keep private for non-criminal reasons?
In this instance we are talking about data that is already encrypted. Now tell me why anyone would worry about their encrypted data being identifiable unless it's not their data. Not something I have to worry about and suspect majority of users don't need to worry about this either.
For the same reason lots of folks may not want their Netflix queue or Amazon purchase list put out in the public. Or their library lending list. All of those are lists of legal, publicly-available items, but that doesn't mean folks might not want and expect privacy when it comes to others knowing what content they are consuming / collecting, regardless of its legality.
> Don't sync copyrighted material. How many different ways do they have to tell you it's illegal?
I don't thing it is illegal though, is it? If I want to sync a bunch of music files that I purchased because I want to back them up, or access them from multiple computers, but I don't make those publicly available, is that illegal?
Genuinely curious, as I've been thinking about doing exactly this recently.
It's not clear to me why Dropbox would need your keys to de-dupe. He says so in the article, but doesn't say why.
Why not compute the file hash on your local machine before encryption, and check that hash against a master dupe list (hash, dupe_count) of all hashes from all users' pre-encrypted local files?
Secondly, I cannot see how this requires there to be an index of users hashes. Surely one could store hashes with reference count, increment when a user adds, decrement when a user deletes. The user ID isn't necessary for a reference counter.
Not saying Dropbox isn't doing what he says. But he says de-duping proves they can decrypt and proves they have a list of who has the same files. I don't see it from de-dupe alone.
The proof is indeed in the deduplication. If Dropbox can skip the upload process of some large file because another user has already uploaded it, they must also be able to decrypt that file in order to sync it with your other machines.
Or in order for you to download it through the web interface unencrypted.
I suspect that dropbox works the way you think it does, but your argument actually has a flaw. It could work like this:
- A hash computed locally (on the clients with the large unencrypted file) and sent along to be used by dropbox to detect dupes.
- The key used to encrypt the large file is some function of the file, but not of the hash. The important point is that it's not encrypted with a client specific key, but rather a file specific one. Thus if you have the file, you can compute it.
- When a dupe is detected, the server requests that the uploading client send it a copy of the key, encrypted PGP so that only the other intended clients can decrypt it
> - The key used to encrypt the large file is some function of the file, but not of the hash. The important point is that it's not encrypted with a client specific key, but rather a file specific one. Thus if you have the file, you can compute it.
That's actually an extremely interesting idea. I wonder if using some function of the data, f, to determine encryption keys, leaks information about the encrypted data? My armchair guess would be yes. Of course, the leaked amount (well, the slight-non random distribution by the key being a function of the data) might be small for large files/good function f, which would mean it's probably okay?
I've never heard of any research of a cryptosystem that works like that though, so I'd appreciate if anyone could provide some expert input (or maybe we should just email bruce schneier)
Interesting idea.. if one were to build the system as you described, there would still need to be a common link for the PGP encryption of the file-specific key.
What do you reckon this is? IIRC, based on the other Dropbox article this week, the only relevant information stored in local Dropbox preferences is your account ID and a unique API token for the specific machine.
There was no mention of a plain-text password being stored locally.
Well, dropbox allows you to access all of your data with only a password, so the PGP pair would have to be a function of the password. I'm not sure I follow about storing the password locally. One additional thing that you'd need to do that I didn't mention is keep the file encryption key PGP encrypted on the server. That way, if you know the password, you can authorize access to the file to someone else without having to transfer the file back to whatever machine you're on.
This is not cryptographically sound. Practically, it's still perfectly possible to check for the presence of a known file and find all associated accounts; formally, the standard definitions don't require that e.g. AES(K, K) is secure ("circular encryption").
More generally, cross-account deduplication always makes it possible to tell if anyone has uploaded a file before - that's the entire point!
Sure, this doesn't address the original attack under discussion, but this subthread was about whether deduplication implies that the server has access to the unencrypted file.
AES might not be suitable for this purpose, but I don't see why the scheme I proposed is necessarily weak. Do you? Anyway, I agree it's not particularly practical - if you're really concerned about data security, then encrypt before you upload.
As to your scheme - circular encryption probably works fine in practice, but there's no theoretical guarantee.
The server cannot read files, but can determine who has uploaded any one cipher- or plaintext. ("Encrypted or unencrypted file".) I think that's still bad.
I wonder if it's possible to fix that too -- that is, could the server receive and store encrypted data from a bunch of people without ever knowing who contributed which data?
> If Dropbox can skip the upload process of some large file because another user has already uploaded it, they must also be able to decrypt that file in order to sync it with your other machines.
Not necessarily. The client could send an encrypted version with only (plaintext) hashes of the pieces. EDIT: no, I'm wrong.
> Or in order for you to download it through the web interface unencrypted.
This one I will give you, unless they're doing something really weird like client side decryption through Javascript, which I'm not sure is even possible. However, they could in theory not store the key until you actually use the web interface (and you don't have to, so they wouldn't have it), and also not store the key when you do.
> Why not compute the file hash on your local machine before encryption, and check that hash against a master dupe list (hash, dupe_count) of all hashes from all users' pre-encrypted local files?
You could do this, but it would still be possible to determine which users have a copy of a particular file (or a piece of a file).
> Secondly, I cannot see how this requires there to be an index of users hashes. Surely one could store hashes with reference count, increment when a user adds, decrement when a user deletes. The user ID isn't necessary for a reference counter.
On the surface, it looks like this would discredit the first claim that I've just made. I think though that in reality it could be detected. For example, the Government could require them to wait and watch until a user downloads a file (or piece of a file) keyed by the hash of the piece whose owners need to be identified. Given that this is feasible, I don't think that there is any point implementing this measure, and it would help to maintain data integrity by not doing it.
Without knowing the internals of how Dropbox operates, my empirical observations are that they employ block-level deduplication, i.e. when you change bits in the middle of the file, the whole thing doesn't get re-uploaded. Which means they keep pointers and have an algorithm that's similar to LBFS (and Rabin fingerprints)
This means it's theoretically possible for parts of the file to come from different sources, which means contraband files are 'built' from parts of otherwise legal files.
With pure file encryption where the user's password serves as the key you lose...password recovery features, public links for files, shared folders, web access, mobile access (unless you want your phone doing the decryption)
All other syncing services do things a pretty similar way
This is wrong in several ways (as evidenced by other comments in the thread):
> public links for files
easily done by creating an unencrypted copy
> shared folders
you can do this by copying, encrypting with new keys for each shared user, and then sending notification of those keys to the user (via some side channel), and then deleting them on your end after they are accessed (you can re-encrypt with the accessor's keys at this point). You can do a bit better if you leave "half open" asymmetric channels (so you can store encrypted messages/keys only the recipient can decrypt), but that might be overkill.
> web access
you probably will need to use java (or NaCl) to do this, as javascript tends to be too slow to do asymmetric encryption (the the needed bignum support just isn't there). If you're willing to wait, it can be done in javascript in ~5 seconds on a desktop PC.
> mobile access
Uh, all modern smart phones can do encryption fast enough (well under a second).
> password recovery features
This is the only salient point. You can kind-of counter it by using the security questions to encrypt the actual encryption key a second time, so that the data can be decrypted if you answer the security questions. But that's obviously less secure.
Rather than avoid deduplication (which is technically sound and benefits everyone), perhaps the solution to this is to make it impossible for DropBox itself to know who owns which files.
E.g. right now I assume a dropbox user owns a list of file ids with some metadata (e.g. that user's name for those files). If follows that if the government decides file XYZ is illegal then anyone with XYZ in their list is in trouble.
The user account could keep track of the total size of all the user's files and use arithmetic to keep it up-to-date, but not actually store the size of individual files except when they are "looked at".
So then the user's password (say) which is not itself stored is used to unlock stuff in the user's file table on a per request basis -- i.e. the actual file ids are only computed as needed. The actual mechanism doesn't need to be terribly secure, it just needs to be deniable. In other words without the user's password we simply cannot unambiguously determine which files are his or hers.
Of course Dropbox knows the keys. If they didn’t, you wouldn’t be able to access your files on so many platforms (web, desktop, iPhone) and you wouldn’t be able to easily share folders with others.
Even though it isn’t spelled out, I’ve always suspected that many actual backup services such as Backblaze don’t know the key if I decide to encrypt my backups.
I don't think the "many clients" argument holds. If done properly, Dropbox would (deterministically) generate the keys from your username and password on the client every time you log in and encrypt/decrypt stuff there.
You're right that sharing folders between different accounts would require having a key shared between the clients and thus stored on the Dropbox servers as well.
However, they could generate the key for the shared folder, give it to you and your buddies and yet store it encrypted using your master key generated from your username and password. Then it would be accessible to you, but Drobox would not be able to decrypt it without getting your password.
It looks like they're not doing that, but hypothetically, they could. And the examples you cite would not be impossible to deal with.
Of course, this would require much more work, would be tricky to get right and they'd have Thomas Ptacek on their back for using JavaScript crypto in the browser.
You are absolutely right in terms of what is technically possible. My point was more that it’s not really realistic to use crypto like that with a Dropbox-like service.
Why doesn't DropBox just stop doing de-duplication? They must have the money for the storage? The bandwidth savings for users isn't that big of a deal in most cases. I expect that if I have a 2GB file that I'm uploading a 2GB file. I don't cross my fingers that you already have big chunks of it.
This just seems like the type of thing that someone much smarter than I can and will exploit in the future.
Dropbox provides lots of storage for free. Therefore they have a strong incentive to provide this free storage at the lowest possible cost, and deduplication definitely makes sense for themselves if not the users.
This makes me worried about hash collisions as well. The article implies that a file whose hash matches something they already have will never even reach their servers - so presumably I just have to keep my fingers crossed that the file they're synchronising to all my machines is the one I uploaded, and not some other user's completely different file that happens to have the same hash?
SHA256, which Dropbox uses, has around 10^77 possible hashes. That's 100 trillion quadrillion quadrillion quadrillion quadrillion possible values. So I wouldn't worry about hash collisions if I were you. If a hash collision happened easily in SHA256, that would be very big news for the security community, and much more serious services than Dropbox would be affected.
This is a common sentiment, but not really sensible. If you want to store (for example) another 64 bits worth of information, you would always be better off with 64 more bits of some strong hash than 64 bits of filesize.
...and if you took the same number of bits you were using to store the filesize, and instead stored that many bits of some independent secure hash, it'd be harder still.
Of course every extra bit that has to be matched makes collisions 'harder' but length bits are much weaker than other options, except insofar as they may already be available for other reasons.
And that was written before the md5 collisions were discovered. And no collision has yet been discovered for md5 for files of the same length, they are all extension attacks...
Absolutely false. Some MD5 collision-generators specifically find pairs of equally-lengthed inputs with the same hash. See for example hit #2 for [MD5 collisions]:
'Extension attacks' are something else, which let you turn one collision into more, or create valid hashes for combinations of unknown text plus a chosen extension – not find an initial collision. See:
The 'length extension' property can be helpful, once you find a collision based on 'random' nonsense, in extending that into two documents that are each meaningful-but-different and still colliding, as was done in this 2005 MD5 collision demonstration:
But still possible. Whenever the number of bits in a file is more than the number of bits in a hash, there are collisions. They could use the Chinese Remainder Theorem, but that would only go so far (maybe far enough to remove substantial doubt? The link below seems to suggest so.)
At first, that was what I thought the flaw would be -- providing a file that has a hash that collides with another file, gets you that file.
But it seems to me you would need to know the exact contents of the file in question to get that to happen, making the point moot. Perhaps I'm wrong on that.
> What are the odds of a hash collision + identical filesize?
If implemented correctly, the additional constraint on filesize being the same is irrelevant. Given one particular hash value, the probability that a second file hashes to the same value is 1/(range of hash function) if the hash function is modeled as an ideal hash function.
Given that they do chunks of updates, rather than the whole file, presumably they check multiple hashes to ensure this doesn't happen. I could be wrong, though.
Early optimization is the root of all evil, so I understand that an up-and-coming company might do this, but Dropbox has the traction and userbase to make this a very relevant concern.
Popularity is also proportional to chance of being targeted by hackers and approached by government or corporation representing intellectual property owners.
If you knew a target file's hash, it might be possible to modify the dropbox client to report that file as added, then dropbox would download that file onto your computer. Of course, 10^77 possible hashes makes it unlikely.
Could be worse. Last year when I tried ZumoDrive (a similar service), I noticed it marks the web-browser login cookies as safe for HTTP, and defaults to open session pages via HTTP. All it takes it checking your ZumoDrive once from an unsecured WiFi and your account may be compromised.
At least Dropbox gets the endpoint-to-server encryption right.
Its not obvious to me this is a price based decision for Dropbox (although the benefit there is obvious).
Arguably the best feature of Dropbox for me is binary diffs. If you encrypt the Dropbox this goes out the window, or at least becomes significantly harder to pull off. Am I wrong?
tl;dr
Dropbox use their own encryption keys to encrypt your data rather than encrypting each user's data using a user provided key. This helps them dedup files and save space/money. This implies that a court could ask to analyse your data. Dropbox will update their privacy policy to say this clearly.
I've been traveling for most of the last six months, mostly dependent on slow & unreliable hotel wifi. Dropbox's implementation has saved me a ton of time backing up files that would have taken forever to upload in their entirety.
Here's a startup idea: A background service that runs on a PC/IPad/Phone, checks for new media files (pictures/mp3s/videos) and automatically re-encodes them, such that the quality, etc. is preserved but the file hash changes and cloud services can no longer deduplicate it.
It will be quite valuable for users of services like Dropbox, Amazon Cloud Player, etc.
Here's another idea: a service that identifies hash values for popular-but-copyright-encumbered files, which you can feed into your (patched copy of) Dropbox client and pretend you're about to upload it. Bam, it appears in your Dropbox account!
And sneakier still, a remote hosted bit torrent client that automates that process for you in a way that can provably show you never downloaded the copyrighted file, you just had a number on your drive, which if interpreted as a SHA256 hash key happened to identify a duplicate of the copyright file on Dropbox... (I'd be spectacularly impressed if someone managed to make _that_ trick fly on court in a precedent setting manner!)
That's wonderfully devious. And probably quite straightforward to implement.
I hope it doesn't happen, because that'd be a bit of a nightmare for Dropbox. They could blacklist known pirate rips of movies and the like, but what do you do when you receive a takedown notice for DRM-free content that some users are legitimately storing, but Dropbox is now (accidentally) illegally distributing? Could be the end of global de-duping.
If this really became a problem for Dropbox, one response would be to have the server issue random challenges to clients that purport to have the file.