100M Creative Commons Flickr Images for Research

GrantS · on July 4, 2014

One of the most incredible parts is that they've already run feature detection on all 100M images/videos and extracted 50TB of:

"SIFT, GIST, Auto Color Correlogram, Gabor Features, CEDD, Color Layout, Edge Histogram, FCTH, Fuzzy Opponent Histogram, Joint Histogram, Kaldi Features, MFCC, SACC_Pitch, and Tonality"

The good part about this for researchers is not only that this saves dozens of CPU-years of computation (back of the envelope, it would take 15 years for my laptop to extract those SIFT features alone), but that any differences in learning/recognition performance on the dataset can be attributed to the algorithms in question, uncomplicated by which researcher engineered the best features for the dataset. On the other hand, it's a challenging dataset to work with because you can't just download it and process it locally as has been traditionally done. I'll be interested to see how many take advantage of it.

joshvm · on July 4, 2014

You wouldn't download the lot, methinks. Not unless you have a big ole cluster to handle it.

The index is only 12GB and contains enough metadata that you can whittle it down to a subset, pull the comments and filter based on those, and ultimately produce a list of photo IDs to grab from the collection. That's a couple of day's work for a grad student, it's not even Big Data.

kastnerkyle · on July 4, 2014

Why the heck would you do MFCC on images? Mel filters try to replicate the perception of human ears on audio. This looks like buzzword soup to me (SACC_Pitch, Tonality? What the heck?!? These seem like audio features - where are the formulas!).

I also don't know about your other conclusion, there is no reason you couldn't download this dataset given enough time/bandwidth/storage to process locally. Most people who will work on this could reasonably store a large chunk locally, if not all (~10 TB). This also assumes that you can't reduce/compress the info any further than what flickr provides and that you require access to the entire dataset - if any of the images are 1024x1024 or larger most feature extractions do not need that kind of fidelity. Heck, you could probably make use of grayscale only to reduce the size by a factor of 3 - ~ 17 TB is feasible (though still pretyt insane) to store locally.

ImageNet (~1.2 TB) only took me 45 days on a residential (<20 MB connection), and I wold assume that this dataset would be downloaded by entities with much higher download b/w. I would also assume that many algorithms, like the type that attack CIFAR10 et. al., would also be willing to reduce the dimensionality and recompress, further reducing storage overhead. How big is each image?

Also, where are the hyperparameters they used to calculate all of these features? Extracted features aren't really that useful without context/reproducibility.

All that said, I think most of these features are decent and the dataset is amazing, but I would rather see them release the raw data set and its PCA/ZCA/other transform - maybe Gabor filtered etc. as well. Lower level preprocessing is more useful for doing representation learning IMO - these higher level features are not that useful for ML algorithm developers. SIFT is patented for heaven's sake! How are we supposed to build algorithms on top of things like that...

I am excited about the dataset but feel that there could be more done to truly enable researchers. This feels like a "look how much data we have/look how awesome and used flickr is" thing to me.

DanBC · on July 4, 2014

> Why the heck would you do MFCC on images? Mel filters try to replicate the perception of human ears on audio.

First page of Google hits.

MFCC Based Face Identification http://www.img.cs.titech.ac.jp/~akbari/pmwiki/uploads/Site/S...

Identification of satellite images based on mel frequency cepstral coefficients http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=538327...

kastnerkyle · on July 4, 2014

Frankly, both papers you linked give no justification for using the mel filterbank instead of any other filtering/frequency reduction before doing the IDCT to get the cepstrum. This is what I mean! Why take a filterbank designed for audio and apply it to images except to use the buzzwords? In any case, at least I know it is used by researchers (even if I don't agree with the why). Thanks!

joshvm · on July 4, 2014

SIFT is free to use for research purposes. Tons of people, papers and algorithms in computer vision use SIFT. It's a very neat detector and the de-facto standard which is no doubt why it's in there.

kastnerkyle · on July 4, 2014

It is good to provide for research purposes - however, why not provide SURF as well if this (research comparison) is the case? SURF is equally popular and arguably more effective. Way more popular than "Tonality" and the other features at the tail end of the list... have my hater hat on here for a moment.

reitanqild · on July 4, 2014

> Why the heck would you do MFCC on images?

Maybe it was the videos?

> SIFT is patented

All the better then that Yahoo acquired a licence somehow, processed the data and made the results available. You don't need a license to compare keypoints I hope?

kastnerkyle · on July 4, 2014

That is the point - if SIFT is patented (though it is free for research it is not clear that Yahoo got commercial licensing) it is prohibited to generate their keypoints, even if it is to compare to your own! Just because they don't prosecute, doesn't mean it is OK. I guess in this case, Yahoo might have taken the heat for you. Hopefully they got a proper license agreement, as that would be great for everyone.

reitanqild · on July 5, 2014

I make an educated guess that Yahoo has a valid license for a public project like this ;-)

> it is prohibited to generate their keypoints, even if it is to compare to your own

But you can still compare them against each other and against other data sets.

clickok · on July 4, 2014

It seems like Yahoo is a little bit worried about possible exploitation. From the Terms of Use:

2.3. You may derive and publish summaries, analyses and interpretations of the Data, but only in a manner where it is impossible to reconstruct the Data from the publication. Small excerpts of the Data may be displayed to others or published in a scientific or technical context, solely for the purpose of describing your research and related issues and not for any commercial or anti-competitive purpose. Unless Yahoo! expressly requests no attribution, all publications resulting from research carried out using the Data must display an attribution to Yahoo!. This attribution must reference "Yahoo! Webscope,” the web address http://webscope.sandbox.yahoo.com, and the name of the specific dataset used, including version number, if applicable. This attribution should preferably appear among the bibliographic citations in the publication. If Yahoo! expressly requests no attribution, you agree not to mention Yahoo! in connection with the Data. Yahoo! invites you to provide a copy your publication to Yahoo!.

This[0] seem fairly restrictive, considering that I can just crawl flickr and get all that data and more, were I so inclined. Also kinda interesting, in this passage and the rest of the TOU: they repeatedly use `"` interchangeably with actual quotation marks ("), suggesting that nobody at Yahoo has proofread their own live TOU. Still, the dataset seems really cool.

[0] ...and other parts of the agreement, but I don't want to spoil it for you, nor post its entirety as a comment.

kastnerkyle · on July 4, 2014

If you can crawl flickr and get 50TB of data, do it.... it is more than a "were I so inclined" situation. I have had a very hard time crawling and indexing large datasets like this - companies tend to protect their data!

spingsprong · on July 4, 2014

"Yahoo is hosting a contest to build the system best capable of identifying where a photo or video was taken without using geographic coordinates."

Does this strike anyone else as being a bad idea?

fancy_pantser · on July 4, 2014

I can't think of any general reasons this is bad, just very narrow cases on the individual level. What are your fears?

spingsprong · on July 4, 2014

Because of the inevitable photolocationfinder dot com that will immediately come into existence if they ever succeed.

Then everyone who hates someone or likes someone way too much, will only need a couple of photos from twitter or wherever, to know where to look for that person in real life.

Houshalter · on July 4, 2014

It's not likely that a computer vision system is going to be that much better than humans at the task. Maybe it will be able to guess your latitude by the color of the sky or something crazy like that, but not give you an exact address.

kastnerkyle · on July 4, 2014

It is not unreasonable to believe that an algorithm could key in on architectural peculiarities of a given region. On top of that, if there are any people who are in the photo who share their address on facebook, twitter, foursquare et. al. it is game over.

Houshalter · on July 5, 2014

I didn't consider that. Reminds me of What Makes Paris Look Like Paris: http://graphics.cs.cmu.edu/projects/whatMakesParis/

spingsprong · on July 5, 2014

Roughly map the topology of the ground in the background of a photo, then use clever algorithms to match that topology with real world maps of terrain.

Automatically match iconic buildings to those in a database.

Have a list of features that are different from nation to nation, or region to region, and match them. Electrical outlets, road signs, road markings, fire hydrant colors, and match them automatically.

Have a list of different geology. Light coloured rock in this region, reddish soil in that region, very flat terrain, very undulating terrain.

This is what comes to mind in one minute of thought. None of these would be effective on their own, but combined you could probably get surprisingly close.

wahnfrieden · on July 4, 2014

Privacy concerns.

sp332 · on July 4, 2014

Trying to hide the location of a photo from someone who is looking at the photo seems hopeless to me.

callmeed · on July 4, 2014

Its better than showing someone photos and seeing if they make them happy or sad.

kops · on July 4, 2014

If you don't want something online, don't put it online.

PeterGriffin · on July 4, 2014

That's like saying "if you don't want to get mobbed, don't go outside".

Most of the apps we use store data somewhere online and the privacy controls are often unclear or misleading.

Putting something online is as easy as grabbing a phone and taking a photo. Boom, it's online... somewhere.

We can't stop services like the one contest is about popping up, but the problems they cause are real.

kastnerkyle · on July 4, 2014

If you don't want photos online, don't take photos with a smartphone. Pretty sure digital cameras don't autoupload (at least the cheap ones I have had). That said, we have a signal and noise situation - if everyone's photos are online it becomes difficult to target any one person without a very good reason.

cclogg · on July 4, 2014

"From the old world of unprocessed rolls of C-41 sitting in a fridge 20 years ago"

Hey I still do that! :(

I wonder if my (or anyone's) film photos on Flickr are completely useless metadata-wise. Because they are all scanned so they just say "NORITSU KOKI EZ Controller". There seems to be a large portion of people (on Flickr) shooting film still but I wonder if it's only a small percentage overall.

jitendraag · on July 5, 2014

Just when I was happy using Flickr's API for creative commons image search - http://www.outreachpanel.com/free-images/

They gave me this huge data to play with :)

In past, I have had issues with CC images that were also tagged with 'getty'. I hope they have taken care of that issue.

chatman · on July 5, 2014

No access to non university based researchers. Useless for me.

liminal · on July 4, 2014

The data is only available to university researchers.

raphar · on July 5, 2014

Isn't this on torrent yet?