A New Dataset and Benchmark for Fine-grained Cross-media Retrieval

We construct a new dataset and benchmark for fine-grained cross-media retrieval. The new dataset is named as PKU FG-XMedia, consisting of 200 fine-grained subcategories of the “Bird”, and contains 4 media types, including image, text, video and audio. The taxonomy is the same as CUB-200-2011 [1]. The total number of media instances exceeds 50,000, and here is some information on media instances:

Text: Text paragraphs are extracted from several encyclopedia websites, such as Wikipedia. These encyclopedia websites are shown in Table 1.
Audio: Audio clips containing the bird sounds from several professional audio websites, such as xeno-canto and Bird-sounds. All the websites are shown in Table 1.
Image: Images are directly from the CUB-200-2011 dataset [1], which is the most widely-used fine-grained image dataset.
Video: Video clips are directly from the YouTube Birds dataset [2], which is a fine-grained video dataset.

Table 1: Data sources for text and audio.

Data	Data Sources
Text	(1) www.wikipedia.org (2) www.allaboutbirds.org (3) www.audubon.org (4) birdsna.org (5) birds.fandom.com (6) nhpbs.org (7) ebird.org (8) mnbirdatlas.org (9) sites.psu.edu (10) www.birdwatchersdigest.com (11) folksread.com (12) neotropical.birds.cornell.edu
Audio	(1) www.xeno-canto.org (2) www.bird-sounds.net (3) www.findsounds.com (4) freesound.org (5) www.macaulaylibrary.org (6) avibase.bsc-eoc.org (7) soundcloud.com

For text, the training and testing sets contain 4,000 texts respectively.

For audio, both of the training and testing sets contain 6,000 audios.

For image and video, we follow the division settings of the original datasets. For image, the training set contains 5,994 images, and the testing set contains 5,794 images. For video, the training set contains 12,666 videos, and the testing set contains 5,684 videos.

We summarize the split of each media type in Table 2.

Table 2: Split of each media type.

Media	Text	Audio	Image	Video
Training	4,000	6,000	5,994	12,666
Testing	4,000	6,000	5,794	5,684

We randomly select several examples of different media types, which are shown in Figure 1.

Visio-PubMed Example 1

Figure 1: Examples of each media type.

All technical papers, documents and reports which use the dataset and benchmark must cite the corresponding papers as follows:

Xiangteng He, Yuxin Peng and Liu Xie, A New Benchmark and Approach for Fine-grained Cross-media Retrieval, 27th ACM Multimedia Conference (ACM MM), pp. 1740-1748, Nice, France, Oct. 21 - 25, 2019.
Yuxin Peng, Xin Huang, and Yunzhen Zhao, An Overview of Cross-media Retrieval: Concepts, Methodologies, Benchmarks and Challenges, IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), Vol. 28, No. 9, pp. 2372-2385, Sep. 2018.

Dataset Download:

For text, we provide the original data. For audio, we provide their spectrogram images by conducting Short-Time Fourier Transformation. Please download the Release Agreement, read it carefully, and handwrite signature by a full-time staff member (that is, student is not acceptable). Then, please scan the signed agreement and send it to SiBo Yin (2401112164@stu.pku.edu.cn). If you are from the mainland of China, please sign the agreement in Chinese rather than English. Then we will verify your request and contact you on how to download the data.

For image and video, they can be download directly from their original sources: CUB-200-2011 and YouTube Birds.

Experimental Results:

2 types of retrieval tasks are conducted for fine-grained cross-media retrieval:

Bi-modality cross-media retrieval. By submitting a query example of any media type, the results of another media type will be retrieved.
Multi-modality cross-media retrieval. By submitting a query example of any media type, the results of all media types will be retrieved.

The compared methods are as follows:

MHTN. Modal-adversarial hybrid transfer network [3].
ACMR. Adversarial cross-modal retrieval [4].
JRL. Joint representation learning [5].
GSPH. Generalized semantic preserving hashing [6].
CMDN. Cross-media multiple deep network [7].
SCAN. Stacked cross attention network [8].
GXN. Generative cross-modal network [9].

We evaluate the retrieval results on the new fine-grained cross-media retrieval dataset with the mean average precision (MAP) score, which is widely used in information retrieval. Experimental results of these compared methods are reported in Table 3 and Table 4 as below.

Table 3: The MAP scores of bi-modality fine-grained cross-media retrieval.

Methods	I->T	I->A	I->V	T->I	T->A	T->V	A->I	A->T	A->V	V->I	V->T	V->A	Average
Ours [10]	0.210	0.526	0.606	0.255	0.181	0.208	0.553	0.159	0.443	0.629	0.193	0.437	0.366
MHTN [3]	0.116	0.195	0.281	0.124	0.138	0.185	0.196	0.127	0.290	0.306	0.186	0.306	0.204
ACMR [4]	0.162	0.119	0.477	0.075	0.015	0.081	0.128	0.028	0.068	0.536	0.138	0.111	0.162
JRL [5]	0.160	0.085	0.435	0.190	0.028	0.095	0.115	0.035	0.065	0.517	0.126	0.068	0.160
GSPH [6]	0.140	0.098	0.413	0.179	0.024	0.109	0.129	0.024	0.073	0.512	0.126	0.086	0.159
CMDN [7]	0.099	0.009	0.377	0.123	0.007	0.078	0.017	0.008	0.010	0.446	0.081	0.009	0.105
SCAN [8]	0.050	-	-	0.050	-	-	-	-	-	-	-	-	0.050
GXN [9]	0.023	-	-	0.035	-	-	-	-	-	-	-	-	0.029

Table 4: The MAP scores of multi-modality fine-grained cross-media retrieval.

Methods	I->All	T->All	A->All	V->All	Average
Ours [10]	0.549	0.196	0.416	0.485	0.412
MHTN [3]	0.208	0.142	0.237	0.341	0.232
GSPH [6]	0.387	0.103	0.075	0.312	0.219
JRL [5]	0.344	0.080	0.069	0.275	0.192
CMDN [7]	0.321	0.071	0.016	0.229	0.159
ACMR [4]	0.245	0.039	0.041	0.279	0.151

New results: please send your results and publication to 2401112164@stu.pku.edu.cn to update your results in Table 3 and Table 4.

Source Code:

The source code has been released on our Github homepage.

References:

[1]Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. California Inst. Technol., Pasadena, CA, USA, Tech. Rep. CNS-TR-2011-001, 2011.
[2]Chen Zhu, Xiao Tan, Feng Zhou, Xiao Liu, Kaiyu Yue, Errui Ding, and Yi Ma. Fine-grained video categorization with redundancy reduction attention. In European Conference on Computer Vision (ECCV), pages 139–155, 2018.
[3]Xin Huang, Yuxin Peng, and Mingkuan Yuan. Mhtn: Modal-adversarial hybrid transfer network for cross-modal retrieval. IEEE transactions on cybernetics (TCYB), 2018.
[4]Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and HengTao Shen. Adversarial cross-modal retrieval. In Proceedings of the 25th ACM international conference on Multimedia (ACM MM), pages 154–162, 2017.
[5]Xiaohua Zhai, Yuxin Peng, and Jianguo Xiao. Learning cross-media joint representation with sparse and semisupervised regularization. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 24(6):965–978, 2014.
[6]Devraj Mandal, Kunal N Chaudhury, and Soma Biswas. Generalized semantic preserving hashing for n-label cross-modal retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4076–4084, 2017.
[7]Yuxin Peng, Xin Huang, and Jinwei Qi. Cross-media shared representation by hierarchical learning with multiple deep networks. In 25th International Joint Conference on Artificial Intelligence (IJCAI), pages 3846–3853, 2016.
[8]Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV), pages 201–216, 2018.
[9]Jiuxiang Gu, Jianfei Cai, Shafiq R Joty, Li Niu, and Gang Wang. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7181–7189, 2018.
[10]Xiangteng He, Yuxin Peng and Liu Xie, A New Benchmark and Approach for Fine-grained Cross-media Retrieval, 27th ACM Multimedia Conference (ACM MM), pp. 1740-1748, Nice, France, Oct. 21 - 25, 2019.

Contact:

Questions and comments can be sent to: 2401112164@stu.pku.edu.cn