Dataset for Cross-Media Retrieval


Welcome to PKU XMediaNet, a large-scale dataset of texts, images, videos, audios and 3D models designed for cross-media retrieval. PKU XMediaNet dataset is the first large-scale cross-media dataset which consists of five media types, with more than 100,000 media instances. Because "X" looks like a cross line, here PKU XMediaNet stands for cross-media retrieval among all the different media types. We have also released PKU XMedia dataset, which is the first cross-media dataset with five media types, containing 12,000 media instances.
The details of PKU XMediaNet dataset and PKU XMedia dataset are shown as below.
Click benchmark to see the benchmark we build up for cross-media retrieval, or click the navigation menu on the left side.

PKU XMediaNet:

Now we have constructed a new dataset named PKU XMediaNet, which consists of 5 media types (text, image, video, audio and 3D model). We select 200 category nodes from WordNet to construct this dataset to ensure the semantic hierarchy structure. These categories can be divided into two main parts: animals and artifacts. There are 48 kinds of animal such as elephant, owl, bee and frog as well as 152 kinds of artifact such as violin, airplane, shotgun, and camera. The total number of media instances will exceed 100,000, and here is some information on media instances in this new dataset:

The dataset is randomly split into a training set of 81,600 media objects and a test set of 20,400 media objects. The random split is performed on each media respectively, the ratio of training set and test set is 4:1. We summarize the split of each media type in Table 1.

Table 1: Split of each media type of PKU XMediaNet dataset

Media Text Image Video Audio 3D
Training 32,000 32,000 8,000 8,000 1,600
Testing 8,000 8,000 2,000 2,000 400

We randomly select several examples of different media types of PKU XMediaNet dataset, which are shown in Figure 1.

All technical papers, documents and reports which use the PKU XMediaNet dataset must cite the corresponding papers as follows:

Visio-PubMed Example 2

Figure 1: Examples of PKU XMediaNet dataset.

PKU XMedia:

PKU XMedia dataset consists of 5,000 texts, 5,000 images, 500 videos, 1,000 audio clips and 500 3D models, all of which are crawled from the famous websites on the Internet as following:

Each media instance has its corresponding category label. The dataset is evenly split into 20 categories, which are insect, bird, wind, dog, tiger, explosion, elephant, flute, airplane, drum, train, laughter, wolf, thunder, horse, autobike, gun, stream, piano and violin. So there are 600 media instances with each category.

Each text is a paragraph of an article about the category in Wikipedia and most of the texts are less than 200 words. The images are pictures with high resolution which contain the object of each category. Long videos in YouTube are segmented into short clips which exactly represent the category, and the video clips in this dataset are mostly less than one minute. The collected audio clips are mostly shorter than one minute, which can stand for the category like wolf howl. The 3D models are objects standing for the 20 semantic categories, such as easily recognizable models dog and tiger. The file formats for five media types are txt, jpg, avi, wav and obj separately, which can be processed by common approaches or tools.

The dataset is randomly split into a training set of 9,600 media objects and a test set of 2,400 media objects. The random split is performed on each media respectively, the ratio of training set and test set is 4:1. We summarize the split of each media type in Table 2.

Table 2: Split of each media type of PKU XMedia dataset

Media Text Image Video Audio 3D
Training 4,000 4,000 400 800 400
Test 1,000 1,000 100 200 100

We randomly select several examples of different media types of PKU XMedia dataset, which are shown in Figure 2.

All technical papers, documents and reports which use the PKU XMedia dataset must cite the corresponding papers as follows:

Visio-PubMed Example 1

Figure 2: Examples of PKU XMedia dataset.

Dataset Download:

PKU XMedia dataset is available with the feature files (text: 10-dimensional LDA, 3,000-dimensional BOW; image: 128-dimensional BoVW, 4,096-dimensional CNN; video: 128-dimensional BoVW, 4,096-dimensional CNN; audio: 29-dimensional MFCC; 3D: 4,700-dimensional LightField).

PKU XMediaNet dataset is available with the source URLs and features of the media instances.

Please download the Release Agreement, read it carefully, and complete it appropriately. Note that the agreement needs a handwritten signature by a full-time staff member (that is, student is not acceptable). Then, please scan the signed agreement and send it to SiBo Yin (2401112164@stu.pku.edu.cn). If you are from the mainland of China, please sign the agreement in Chinese rather than English. Then we will verify your request and contact you on how to download the data.