The YFCC100M is the largest publicly and freely useable multimedia collection, containing the metadata of around 99.2 million photos and 0.8 million videos from Flickr, all of which were shared under one of the various Creative Commons licenses.
The original images and videos indexed in the YFCC100M may be found in the data / directory on the multimedia-commons S3 data store on AWS Public Data Sets. This directory has 99,171,688 image files and 787,479 video files. The videos add up to around 8,081 hours, with an average video length of 37s and a median length of 28s.
Getting the YFCC100M: The dataset can be requested at Yahoo Webscope. You will need to create a Yahoo account if you do not have one already, and once logged in you will find it straightforward to submit the request for the YFCC100M. Webscope will ask you to tell them what your plans are with the dataset, which helps them justify the existence of their academic outreach program and allows them to keep offering datasets in the future. Unlike other datasets available at Webscope, the YFCC100M does not require you to be a student or faculty at an accredited university, so you will be automatically approved. An email with access instructions should be sent to you within the hour, but it is known these occasionally are marked as spam so if you receive nothing please email them directly.
Accessing the YFCC100M: The dataset is hosted by Amazon in an S3 data bucket. The size of the dataset is around 15GB and is stored as a single Bzip2-compressed file named yfcc100m_dataset.bz2. At the moment you will need to have an AWS account to download the file from the bucket, although Webscope is working to find a solution so you can get the dataset without needing one. In particular, a credit card is required to create an AWS account (even though it will never be charged when you just download data from a bucket), and in many countries it is not common to have such a card.
Using the YFCC100M: Once you have downloaded the dataset, you can directly read its contents using command line tools such as bzcat. In many situations, however, it would be much easier and faster to launch an EC2 instance and install a Hadoop cluster to efficiently access and process the dataset directly from the S3 bucket. To avoid the hassle of installing and maintaining Hadoop yourself, you can also launch an EMR cluster. Naturally, using an EC2 instance or an EMR cluster will not be free, but you will get convenience in return.
License: Use of the dataset is subject to the relevant Webscope License Agreement, which you need to agree to before being granted access to the dataset.
Related Articles and Documents
YFCC100M: The New Data in Multimedia Research; 2016; Thomee, B. et al; Communications of the ACM
A Poodle or a Dog? Evaluating Automatic Image Annotation Using Human Descriptions at Different Levels of Granularity; 2014; Wang, J. et al; Department of Computer Science, University of Sheffield, UK 2 Centre for Vision, Speech and Signal Processing, University of Surrey, UK;
Deep Features YFCC100M-HNfc6 – In this site we publish deep features extracted from various relevant datasets. At the moment the deep features datasets we have published are:
The deep features were extracted using the Caffe framework. In particular they took the activation of the neurons in the fc6 layer of the Hybrid-CNN whose model and weights are public available in the Caffe Model Zoo. The Hybrid-CNN was trained on 1,183 categories (205 scene categories from Places Database and 978 object categories from the train data of ILSVRC2012 (ImageNet) with ~3.6 million images. The architecture is the same as Caffe reference network. More information can be found on the Places-CNN model webapage at MIT.