A New Dataset and Benchmark for Fine-grained Cross-media Retrieval

We construct a new dataset and benchmark for fine-grained cross-media retrieval. The new dataset is named as PKU FG-XMedia, consisting of 200 fine-grained subcategories of the “Bird”, and contains 4 media types, including image, text, video and audio. The taxonomy is the same as CUB-200-2011 [1]. The total number of media instances exceeds 50,000, and here is some information on media instances:

Table 1: Data sources for text and audio.

Data Data Sources
Text (1) www.wikipedia.org (2) www.allaboutbirds.org (3) www.audubon.org (4) birdsna.org (5) birds.fandom.com (6) nhpbs.org (7) ebird.org (8) mnbirdatlas.org
(9) sites.psu.edu (10) www.birdwatchersdigest.com (11) folksread.com (12) neotropical.birds.cornell.edu
Audio (1) www.xeno-canto.org (2) www.bird-sounds.net (3) www.findsounds.com (4) freesound.org (5) www.macaulaylibrary.org (6) avibase.bsc-eoc.org
(7) soundcloud.com

For text, the training and testing sets contain 4,000 texts respectively.

For audio, both of the training and testing sets contain 6,000 audios.

For image and video, we follow the division settings of the original datasets. For image, the training set contains 5,994 images, and the testing set contains 5,794 images. For video, the training set contains 12,666 videos, and the testing set contains 5,684 videos.

We summarize the split of each media type in Table 2.

Table 2: Split of each media type.

Media Text Audio Image Video
Training 4,000 6,000 5,994 12,666
Testing 4,000 6,000 5,794 5,684

We randomly select several examples of different media types, which are shown in Figure 1.

Visio-PubMed Example 1

Figure 1: Examples of each media type.

All technical papers, documents and reports which use the dataset and benchmark must cite the corresponding papers as follows:

Dataset Download:

For text, we provide the original data. For audio, we provide their spectrogram images by conducting Short-Time Fourier Transformation. Please download the Release Agreement, read it carefully, and handwrite signature by a full-time staff member (that is, student is not acceptable). Then, please scan the signed agreement and send it to SiBo Yin (2401112164@stu.pku.edu.cn). If you are from the mainland of China, please sign the agreement in Chinese rather than English. Then we will verify your request and contact you on how to download the data.

For image and video, they can be download directly from their original sources: CUB-200-2011 and YouTube Birds.

Experimental Results:

2 types of retrieval tasks are conducted for fine-grained cross-media retrieval:

The compared methods are as follows:

We evaluate the retrieval results on the new fine-grained cross-media retrieval dataset with the mean average precision (MAP) score, which is widely used in information retrieval. Experimental results of these compared methods are reported in Table 3 and Table 4 as below.

Table 3: The MAP scores of bi-modality fine-grained cross-media retrieval.

Methods I->T I->A I->V T->I T->A T->V A->I A->T A->V V->I V->T V->A Average
Ours [10] 0.210 0.526 0.606 0.255 0.181 0.208 0.553 0.159 0.443 0.629 0.193 0.437 0.366
MHTN [3] 0.116 0.195 0.281 0.124 0.138 0.185 0.196 0.127 0.290 0.306 0.186 0.306 0.204
ACMR [4] 0.162 0.119 0.477 0.075 0.015 0.081 0.128 0.028 0.068 0.536 0.138 0.111 0.162
JRL [5] 0.160 0.085 0.435 0.190 0.028 0.095 0.115 0.035 0.065 0.517 0.126 0.068 0.160
GSPH [6] 0.140 0.098 0.413 0.179 0.024 0.109 0.129 0.024 0.073 0.512 0.126 0.086 0.159
CMDN [7] 0.099 0.009 0.377 0.123 0.007 0.078 0.017 0.008 0.010 0.446 0.081 0.009 0.105
SCAN [8] 0.050 - - 0.050 - - - - - - - - 0.050
GXN [9] 0.023 - - 0.035 - - - - - - - - 0.029

Table 4: The MAP scores of multi-modality fine-grained cross-media retrieval.

Methods I->All T->All A->All V->All Average
Ours [10] 0.549 0.196 0.416 0.485 0.412
MHTN [3] 0.208 0.142 0.237 0.341 0.232
GSPH [6] 0.387 0.103 0.075 0.312 0.219
JRL [5] 0.344 0.080 0.069 0.275 0.192
CMDN [7] 0.321 0.071 0.016 0.229 0.159
ACMR [4] 0.245 0.039 0.041 0.279 0.151

New results: please send your results and publication to 2401112164@stu.pku.edu.cn to update your results in Table 3 and Table 4.

Source Code:

The source code has been released on our Github homepage.

References:

[1]Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. California Inst. Technol., Pasadena, CA, USA, Tech. Rep. CNS-TR-2011-001, 2011.
[2]Chen Zhu, Xiao Tan, Feng Zhou, Xiao Liu, Kaiyu Yue, Errui Ding, and Yi Ma. Fine-grained video categorization with redundancy reduction attention. In European Conference on Computer Vision (ECCV), pages 139–155, 2018.
[3]Xin Huang, Yuxin Peng, and Mingkuan Yuan. Mhtn: Modal-adversarial hybrid transfer network for cross-modal retrieval. IEEE transactions on cybernetics (TCYB), 2018.
[4]Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and HengTao Shen. Adversarial cross-modal retrieval. In Proceedings of the 25th ACM international conference on Multimedia (ACM MM), pages 154–162, 2017.
[5]Xiaohua Zhai, Yuxin Peng, and Jianguo Xiao. Learning cross-media joint representation with sparse and semisupervised regularization. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 24(6):965–978, 2014.
[6]Devraj Mandal, Kunal N Chaudhury, and Soma Biswas. Generalized semantic preserving hashing for n-label cross-modal retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4076–4084, 2017.
[7]Yuxin Peng, Xin Huang, and Jinwei Qi. Cross-media shared representation by hierarchical learning with multiple deep networks. In 25th International Joint Conference on Artificial Intelligence (IJCAI), pages 3846–3853, 2016.
[8]Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV), pages 201–216, 2018.
[9]Jiuxiang Gu, Jianfei Cai, Shafiq R Joty, Li Niu, and Gang Wang. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7181–7189, 2018.
[10]Xiangteng He, Yuxin Peng and Liu Xie, A New Benchmark and Approach for Fine-grained Cross-media Retrieval, 27th ACM Multimedia Conference (ACM MM), pp. 1740-1748, Nice, France, Oct. 21 - 25, 2019.

Contact:

Questions and comments can be sent to: 2401112164@stu.pku.edu.cn