One of the most incredible parts is that they've already run feature detection on all 100M images/videos and extracted 50TB of:
"SIFT, GIST, Auto Color Correlogram, Gabor Features, CEDD, Color Layout, Edge Histogram, FCTH, Fuzzy Opponent Histogram, Joint Histogram, Kaldi Features, MFCC, SACC_Pitch, and Tonality"
The good part about this for researchers is not only that this saves dozens of CPU-years of computation (back of the envelope, it would take 15 years for my laptop to extract those SIFT features alone), but that any differences in learning/recognition performance on the dataset can be attributed to the algorithms in question, uncomplicated by which researcher engineered the best features for the dataset. On the other hand, it's a challenging dataset to work with because you can't just download it and process it locally as has been traditionally done. I'll be interested to see how many take advantage of it.
You wouldn't download the lot, methinks. Not unless you have a big ole cluster to handle it.
The index is only 12GB and contains enough metadata that you can whittle it down to a subset, pull the comments and filter based on those, and ultimately produce a list of photo IDs to grab from the collection. That's a couple of day's work for a grad student, it's not even Big Data.
Why the heck would you do MFCC on images? Mel filters try to replicate the perception of human ears on audio. This looks like buzzword soup to me (SACC_Pitch, Tonality? What the heck?!? These seem like audio features - where are the formulas!).
I also don't know about your other conclusion, there is no reason you couldn't download this dataset given enough time/bandwidth/storage to process locally. Most people who will work on this could reasonably store a large chunk locally, if not all (~10 TB). This also assumes that you can't reduce/compress the info any further than what flickr provides and that you require access to the entire dataset - if any of the images are 1024x1024 or larger most feature extractions do not need that kind of fidelity. Heck, you could probably make use of grayscale only to reduce the size by a factor of 3 - ~ 17 TB is feasible (though still pretyt insane) to store locally.
ImageNet (~1.2 TB) only took me 45 days on a residential (<20 MB connection), and I wold assume that this dataset would be downloaded by entities with much higher download b/w. I would also assume that many algorithms, like the type that attack CIFAR10 et. al., would also be willing to reduce the dimensionality and recompress, further reducing storage overhead. How big is each image?
Also, where are the hyperparameters they used to calculate all of these features? Extracted features aren't really that useful without context/reproducibility.
All that said, I think most of these features are decent and the dataset is amazing, but I would rather see them release the raw data set and its PCA/ZCA/other transform - maybe Gabor filtered etc. as well. Lower level preprocessing is more useful for doing representation learning IMO - these higher level features are not that useful for ML algorithm developers. SIFT is patented for heaven's sake! How are we supposed to build algorithms on top of things like that...
I am excited about the dataset but feel that there could be more done to truly enable researchers. This feels like a "look how much data we have/look how awesome and used flickr is" thing to me.
Frankly, both papers you linked give no justification for using the mel filterbank instead of any other filtering/frequency reduction before doing the IDCT to get the cepstrum. This is what I mean! Why take a filterbank designed for audio and apply it to images except to use the buzzwords? In any case, at least I know it is used by researchers (even if I don't agree with the why). Thanks!
SIFT is free to use for research purposes. Tons of people, papers and algorithms in computer vision use SIFT. It's a very neat detector and the de-facto standard which is no doubt why it's in there.
It is good to provide for research purposes - however, why not provide SURF as well if this (research comparison) is the case? SURF is equally popular and arguably more effective. Way more popular than "Tonality" and the other features at the tail end of the list... have my hater hat on here for a moment.
All the better then that Yahoo acquired a licence somehow, processed the data and made the results available. You don't need a license to compare keypoints I hope?
That is the point - if SIFT is patented (though it is free for research it is not clear that Yahoo got commercial licensing) it is prohibited to generate their keypoints, even if it is to compare to your own! Just because they don't prosecute, doesn't mean it is OK. I guess in this case, Yahoo might have taken the heat for you. Hopefully they got a proper license agreement, as that would be great for everyone.
It seems like Yahoo is a little bit worried about possible exploitation. From the Terms of Use:
2.3. You may derive and publish summaries, analyses and interpretations of the Data, but only in a manner where it is impossible to reconstruct the Data from the publication. Small excerpts of the Data may be displayed to others or published in a scientific or technical context, solely for the purpose of describing your research and related issues and not for any commercial or anti-competitive purpose. Unless Yahoo! expressly requests no attribution, all publications resulting from research carried out using the Data must display an attribution to Yahoo!. This attribution must reference "Yahoo! Webscope,” the web address http://webscope.sandbox.yahoo.com, and the name of the specific dataset used, including version number, if applicable. This attribution should preferably appear among the bibliographic citations in the publication. If Yahoo! expressly requests no attribution, you agree not to mention Yahoo! in connection with the Data. Yahoo! invites you to provide a copy your publication to Yahoo!.
This[0] seem fairly restrictive, considering that I can just crawl flickr and get all that data and more, were I so inclined. Also kinda interesting, in this passage and the rest of the TOU: they repeatedly use `"` interchangeably with actual quotation marks ("), suggesting that nobody at Yahoo has proofread their own live TOU. Still, the dataset seems really cool.
[0] ...and other parts of the agreement, but I don't want to spoil it for you, nor post its entirety as a comment.
If you can crawl flickr and get 50TB of data, do it.... it is more than a "were I so inclined" situation. I have had a very hard time crawling and indexing large datasets like this - companies tend to protect their data!
Because of the inevitable photolocationfinder dot com that will immediately come into existence if they ever succeed.
Then everyone who hates someone or likes someone way too much, will only need a couple of photos from twitter or wherever, to know where to look for that person in real life.
It's not likely that a computer vision system is going to be that much better than humans at the task. Maybe it will be able to guess your latitude by the color of the sky or something crazy like that, but not give you an exact address.
It is not unreasonable to believe that an algorithm could key in on architectural peculiarities of a given region. On top of that, if there are any people who are in the photo who share their address on facebook, twitter, foursquare et. al. it is game over.
Roughly map the topology of the ground in the background of a photo, then use clever algorithms to match that topology with real world maps of terrain.
Automatically match iconic buildings to those in a database.
Have a list of features that are different from nation to nation, or region to region, and match them. Electrical outlets, road signs, road markings, fire hydrant colors, and match them automatically.
Have a list of different geology. Light coloured rock in this region, reddish soil in that region, very flat terrain, very undulating terrain.
This is what comes to mind in one minute of thought. None of these would be effective on their own, but combined you could probably get surprisingly close.
If you don't want photos online, don't take photos with a smartphone. Pretty sure digital cameras don't autoupload (at least the cheap ones I have had). That said, we have a signal and noise situation - if everyone's photos are online it becomes difficult to target any one person without a very good reason.
"From the old world of unprocessed rolls of C-41 sitting in a fridge 20 years ago"
Hey I still do that! :(
I wonder if my (or anyone's) film photos on Flickr are completely useless metadata-wise. Because they are all scanned so they just say "NORITSU KOKI EZ Controller". There seems to be a large portion of people (on Flickr) shooting film still but I wonder if it's only a small percentage overall.
"SIFT, GIST, Auto Color Correlogram, Gabor Features, CEDD, Color Layout, Edge Histogram, FCTH, Fuzzy Opponent Histogram, Joint Histogram, Kaldi Features, MFCC, SACC_Pitch, and Tonality"
The good part about this for researchers is not only that this saves dozens of CPU-years of computation (back of the envelope, it would take 15 years for my laptop to extract those SIFT features alone), but that any differences in learning/recognition performance on the dataset can be attributed to the algorithms in question, uncomplicated by which researcher engineered the best features for the dataset. On the other hand, it's a challenging dataset to work with because you can't just download it and process it locally as has been traditionally done. I'll be interested to see how many take advantage of it.