A New Dataset and Benchmark for Fine-grained Cross-media Retrieval
We construct a new dataset and benchmark for fine-grained cross-media retrieval. The new dataset is named as PKU FG-XMedia, consisting of 200 fine-grained subcategories of the “Bird”, and contains 4 media types, including image, text, video and audio. The taxonomy is the same as CUB-200-2011 [1]. The total number of media instances exceeds 50,000, and here is some information on media instances:
Table 1: Data sources for text and audio.
Data | Data Sources |
Text | (1) www.wikipedia.org (2) www.allaboutbirds.org (3) www.audubon.org (4) birdsna.org (5) birds.fandom.com (6) nhpbs.org (7) ebird.org (8) mnbirdatlas.org (9) sites.psu.edu (10) www.birdwatchersdigest.com (11) folksread.com (12) neotropical.birds.cornell.edu |
Audio | (1) www.xeno-canto.org (2) www.bird-sounds.net (3) www.findsounds.com (4) freesound.org (5) www.macaulaylibrary.org (6) avibase.bsc-eoc.org (7) soundcloud.com |
For text, the training and testing sets contain 4,000 texts respectively. For audio, both of the training and testing sets contain 6,000 audios. For image and video, we follow the division settings of the original datasets. For image, the training set contains 5,994 images, and the testing set contains 5,794 images. For video, the training set contains 12,666 videos, and the testing set contains 5,684 videos. We summarize the split of each media type in Table 2.
Table 2: Split of each media type.
Media | Text | Audio | Image | Video |
Training | 4,000 | 6,000 | 5,994 | 12,666 |
Testing | 4,000 | 6,000 | 5,794 | 5,684 |
We randomly select several examples of different media types, which are shown in Figure 1.
Figure 1: Examples of each media type.
All technical papers, documents and reports which use the dataset and benchmark must cite the corresponding papers as follows:
For text, we provide the original data. For audio, we provide their spectrogram images by conducting Short-Time Fourier Transformation. Please download the Release Agreement, read it carefully, and handwrite signature by a full-time staff member (that is, student is not acceptable). Then, please scan the signed agreement and send it to SiBo Yin (2401112164@stu.pku.edu.cn). If you are from the mainland of China, please sign the agreement in Chinese rather than English. Then we will verify your request and contact you on how to download the data. For image and video, they can be download directly from their original sources: CUB-200-2011 and YouTube Birds.
2 types of retrieval tasks are conducted for fine-grained cross-media retrieval:
The compared methods are as follows:
We evaluate the retrieval results on the new fine-grained cross-media retrieval dataset with the mean average precision (MAP) score, which is widely used in information retrieval. Experimental results of these compared methods are reported in Table 3 and Table 4 as below.
Table 3: The MAP scores of bi-modality fine-grained cross-media retrieval.
Methods | I->T | I->A | I->V | T->I | T->A | T->V | A->I | A->T | A->V | V->I | V->T | V->A | Average |
Ours [10] | 0.210 | 0.526 | 0.606 | 0.255 | 0.181 | 0.208 | 0.553 | 0.159 | 0.443 | 0.629 | 0.193 | 0.437 | 0.366 |
MHTN [3] | 0.116 | 0.195 | 0.281 | 0.124 | 0.138 | 0.185 | 0.196 | 0.127 | 0.290 | 0.306 | 0.186 | 0.306 | 0.204 |
ACMR [4] | 0.162 | 0.119 | 0.477 | 0.075 | 0.015 | 0.081 | 0.128 | 0.028 | 0.068 | 0.536 | 0.138 | 0.111 | 0.162 |
JRL [5] | 0.160 | 0.085 | 0.435 | 0.190 | 0.028 | 0.095 | 0.115 | 0.035 | 0.065 | 0.517 | 0.126 | 0.068 | 0.160 |
GSPH [6] | 0.140 | 0.098 | 0.413 | 0.179 | 0.024 | 0.109 | 0.129 | 0.024 | 0.073 | 0.512 | 0.126 | 0.086 | 0.159 |
CMDN [7] | 0.099 | 0.009 | 0.377 | 0.123 | 0.007 | 0.078 | 0.017 | 0.008 | 0.010 | 0.446 | 0.081 | 0.009 | 0.105 |
SCAN [8] | 0.050 | - | - | 0.050 | - | - | - | - | - | - | - | - | 0.050 |
GXN [9] | 0.023 | - | - | 0.035 | - | - | - | - | - | - | - | - | 0.029 |
Table 4: The MAP scores of multi-modality fine-grained cross-media retrieval.
Methods | I->All | T->All | A->All | V->All | Average |
Ours [10] | 0.549 | 0.196 | 0.416 | 0.485 | 0.412 |
MHTN [3] | 0.208 | 0.142 | 0.237 | 0.341 | 0.232 |
GSPH [6] | 0.387 | 0.103 | 0.075 | 0.312 | 0.219 |
JRL [5] | 0.344 | 0.080 | 0.069 | 0.275 | 0.192 |
CMDN [7] | 0.321 | 0.071 | 0.016 | 0.229 | 0.159 |
ACMR [4] | 0.245 | 0.039 | 0.041 | 0.279 | 0.151 |
New results: please send your results and publication to 2401112164@stu.pku.edu.cn to update your results in Table 3 and Table 4.
The source code has been released on our Github homepage.
[1]Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. California Inst. Technol., Pasadena, CA, USA, Tech. Rep. CNS-TR-2011-001, 2011.
[2]Chen Zhu, Xiao Tan, Feng Zhou, Xiao Liu, Kaiyu Yue, Errui Ding, and Yi Ma. Fine-grained video categorization with redundancy reduction attention. In European Conference on Computer Vision (ECCV), pages 139–155, 2018.
[3]Xin Huang, Yuxin Peng, and Mingkuan Yuan. Mhtn: Modal-adversarial hybrid transfer network for cross-modal retrieval. IEEE transactions on cybernetics (TCYB), 2018.
[4]Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and HengTao Shen. Adversarial cross-modal retrieval. In Proceedings of the 25th ACM international conference on Multimedia (ACM MM), pages 154–162, 2017.
[5]Xiaohua Zhai, Yuxin Peng, and Jianguo Xiao. Learning cross-media joint representation with sparse and semisupervised regularization. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 24(6):965–978, 2014.
[6]Devraj Mandal, Kunal N Chaudhury, and Soma Biswas. Generalized semantic preserving hashing for n-label cross-modal retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4076–4084, 2017.
[7]Yuxin Peng, Xin Huang, and Jinwei Qi. Cross-media shared representation by hierarchical learning with multiple deep networks. In 25th International Joint Conference on Artificial Intelligence (IJCAI), pages 3846–3853, 2016.
[8]Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV), pages 201–216, 2018.
[9]Jiuxiang Gu, Jianfei Cai, Shafiq R Joty, Li Niu, and Gang Wang. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7181–7189, 2018.
[10]Xiangteng He, Yuxin Peng and Liu Xie, A New Benchmark and Approach for Fine-grained Cross-media Retrieval, 27th ACM Multimedia Conference (ACM MM), pp. 1740-1748, Nice, France, Oct. 21 - 25, 2019.
Questions and comments can be sent to: 2401112164@stu.pku.edu.cn