基于细粒度联合注意力机制的图像-文本跨模态实体分辨

doi:10.1007/s12204-022-2465-y

摘要/Abstract

摘要： 为了解决现有跨模态实体分辨方法容易忽略数据高维度语义信息的问题，提出了一种基于细粒度联合注意力机制的图像-文本跨模态实体分辨方法（Cross-Modal Entity Resolution for Image and Text Integrating Global and Fine-grained Joint Attention Mechanism , IGFJAM）。首先，采用特征提取网络，将跨模态数据映射至共同嵌入空间；然后，结合全局联合注意力机制和局部细粒度联合注意力机制，学习跨模态数据间的全局语义关联信息和细粒度联合语义关联信息，有效地提高模型跨模态实体分辨性能。通过在Flickr-30K和MS-COCO公开数据集中进行测试，与现有方法相比，IGFJAM在R@sum性能上分别提高了4.30%和4.54%。

关键词: 跨模态实体分辨，联合注意力机制，深度学习，特征提取，语义关联

Abstract: In order to solve the problem that the existing cross-modal entity resolution methods easily ignore the high-level semantic informational correlations between cross-modal data, we propose a novel cross-modal entity resolution for image and text integrating global and fine-grained joint attention mechanism method. First, we map the cross-modal data to a common embedding space utilizing a feature extraction network. Then, we integrate global joint attention mechanism and fine-grained joint attention mechanism, making the model have the ability to learn the global semantic characteristics and the local fine-grained semantic characteristics of the cross-modal data, which is used to fully exploit the cross-modal semantic correlation and boost the performance of cross-modal entity resolution. Moreover, experiments on Flickr-30K and MS-COCO datasets show that the overall performance of R@sum outperforms by 4.30% and 4.54% compared with 5 state-of-the-art methods, respectively, which can fully demonstrate the superiority of our proposed method.

Key words: cross-modal entity resolution, joint attention mechanism, deep learning, feature extraction, semantic correlation

中图分类号:

TP311

曾志贤，曹建军，翁年凤，袁震，余旭. 基于细粒度联合注意力机制的图像-文本跨模态实体分辨[J]. J Shanghai Jiaotong Univ Sci, 2023, 28(6): 728-737.

ZENG Zhirian(曾志贤),CAO Jianjun*(曹建军),WENG Nianfeng(翁年凤)，YUAN Zhen(袁震)，YU Xu(余旭). Cross-Modal Entity Resolution for Image and Text Integrating Global and Fine-Grained Joint Attention Mechanism[J]. J Shanghai Jiaotong Univ Sci, 2023, 28(6): 728-737.

参考文献 24

[9]	LV G, CAO J, ZHENG Q, et al. Cross-modal entity resolution based on co-attentional generative adversarial network [C]//2019 4th International Conference on Multimedia Systems and Signal Processing. Guangzhou, China: ACM, 2019: 42-46.
[1]	PENG Y X, HUANG X, ZHAO Y Z. An overview of cross-media retrieval: Concepts, methodologies, benchmarks, and challenges [J]. IEEE Transactions on Circuits and Systems for Video Technology, 2018, 28(9): 2372-2385.[2] RASIWASIA N, PEREIRA J C, COVIELLO E, et al. A new approach to cross-modal multimedia retrieval [C]//18th ACM International Conference on Multimedia. Firenze, Italy: ACM, 2010: 251-260.
[10]	QI J W, PENG Y X, YUAN Y X. Cross-media multi-level alignment with relation attention network [C]//27th International Joint Conference on Artificial Intelligence. Stockholm, Sweden: IEEE, 2018: 892- 898.
[3]	HOTELLING H. Relations between two sets of variates [M]//Breakthroughs in statistics. New York: Springer, 1992: 162-190.
[11]	PENG Y, QI J, ZHUO Y. MAVA: Multi-level adaptive visual-textual alignment by cross-media bi-attention mechanism [J]. IEEE Transactions on Image Processin, 2019, 29: 2728-2741.
[4]	JIANG B, YANG J C, LV Z H, et al. Internet crossmedia retrieval based on deep learning [J]. Journal of Visual Communication and Image Representation, 2017, 48: 356-366.
[12]	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 770-778.
[5]	FROME A, CORRADO G S, SHLENS J, et al. DeViSE: A deep visual-semantic embedding model [M]//Advances in neural information processing systems 26. Red Hook, NY: Curran Associates Inc., 2013: 2121-2129.
[13]	DENG J, DONG W, SOCHER R, et al. ImageNet: A large-scale hierarchical image database [C]//2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, FL, USA: IEEE, 2009: 248-255.
[6]	PENG Y X, QI J W. CM-GANs: Cross-modal generative adversarial networks for common representation learning [J]. ACM Transactions on Multimedia Computing, Communications, and Applications, 2019, 15(1): 22.
[14]	REN S Q, HE K M, GIRSHICK R, et al. Faster RCNN: Towards real-time object detection with region proposal networks [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137- 1149.
[7]	ANDREW G, ARORA R, BILMES J, et al. Deep canonical correlation analysis [C]//Proceedings of the 30th International Conference on Machine Learning. Atlanta, USA: PMLR, 2013: 1247-1255.
[15]	ANDERSON P, HE X D, BUEHLER C, et al. Bottomup and top-down attention for image captioning and visual question answering [C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018: 6077-6086.
[8]	LU Y H, YU J, LIU Y B, et al. Fine-grained correlation learning with stacked co-attention networks for crossmodal information retrieval [M]//Knowledge science, engineering and management. Cham: Springer, 2018: 213-225.
[16]	CHO K, VAN MERRIENBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation [C]//2014 Conference on Empirical Methods in Natural Language Processing. Doha, Qatar: ACM, 2014: 1724-1734.
[9]	LV G, CAO J, ZHENG Q, et al. Cross-modal entity resolution based on co-attentional generative adversarial network [C]//2019 4th International Conference on Multimedia Systems and Signal Processing. Guangzhou, China: ACM, 2019: 42-46.
[17]	HOCHREITER S, SCHMIDHUBER J. Long shortterm memory [J]. Neural Computation, 1997, 9(8): 1735-1780.
[10]	QI J W, PENG Y X, YUAN Y X. Cross-media multi-level alignment with relation attention network [C]//27th International Joint Conference on Artificial Intelligence. Stockholm, Sweden: IEEE, 2018: 892- 898.
[18]	LIN X, PARIKH D. Leveraging visual question answering for image-caption ranking [M]//Computer visionECCV 2016. Cham: Springer, 2016: 261-277.
[11]	PENG Y, QI J, ZHUO Y. MAVA: Multi-level adaptive visual-textual alignment by cross-media bi-attention mechanism [J]. IEEE Transactions on Image Processin, 2019, 29: 2728-2741.
[19]	YOUNG P, LAI A, HODOSH M, et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions [J]. Transactions of the Association for Computational Linguistics, 2014, 2: 67-78.
[12]	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 770-778.
[20]	LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: Common objects in context [M]//Computer vision-ECCV 2014. Cham: Springer, 2014: 740-755.
[13]	DENG J, DONG W, SOCHER R, et al. ImageNet: A large-scale hierarchical image database [C]//2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, FL, USA: IEEE, 2009: 248-255.
[21]	LEE K H, CHEN X, HUA G, et al. Stacked cross attention for image-text matching [M]//Computer visionECCV 2018. Cham: Springer, Cham, 2018: 212-218.
[14]	REN S Q, HE K M, GIRSHICK R, et al. Faster RCNN: Towards real-time object detection with region proposal networks [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137- 1149.
[22]	HUANG Y, WU Q, SONG C F, et al. Learning semantic concepts and order for image and sentence matching [C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018: 6163-6171.
[15]	ANDERSON P, HE X D, BUEHLER C, et al. Bottomup and top-down attention for image captioning and visual question answering [C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018: 6077-6086.
[23]	FAGHRI F, FLEET D J, KIROS J R, et al. VSE++: Improving visual-semantic embeddings with hard negatives [EB/OL]. (2018-07-29). https://arxiv.org/abs/1707.05612.
[16]	CHO K, VAN MERRIENBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation [C]//2014 Conference on Empirical Methods in Natural Language Processing. Doha, Qatar: ACM, 2014: 1724-1734.
[24]	ZHENG Z D, ZHENG L, GARRETT M, et al. Dualpath convolutional image-text embeddings with instance loss [J]. ACM Transactions on Multimedia Computing Communications and Applications, 2020, 16(2): 1-23.
[17]	HOCHREITER S, SCHMIDHUBER J. Long shortterm memory [J]. Neural Computation, 1997, 9(8): 1735-1780.
[18]	LIN X, PARIKH D. Leveraging visual question answering for image-caption ranking [M]//Computer visionECCV 2016. Cham: Springer, 2016: 261-277.
[19]	YOUNG P, LAI A, HODOSH M, et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions [J]. Transactions of the Association for Computational Linguistics, 2014, 2: 67-78.
[20]	LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: Common objects in context [M]//Computer vision-ECCV 2014. Cham: Springer, 2014: 740-755.
[21]	LEE K H, CHEN X, HUA G, et al. Stacked cross attention for image-text matching [M]//Computer visionECCV 2018. Cham: Springer, Cham, 2018: 212-218.
[22]	HUANG Y, WU Q, SONG C F, et al. Learning semantic concepts and order for image and sentence matching [C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018: 6163-6171.
[23]	FAGHRI F, FLEET D J, KIROS J R, et al. VSE++: Improving visual-semantic embeddings with hard negatives [EB/OL]. (2018-07-29). https://arxiv.org/abs/1707.05612.
[24]	ZHENG Z D, ZHENG L, GARRETT M, et al. Dualpath convolutional image-text embeddings with instance loss [J]. ACM Transactions on Multimedia Computing Communications and Applications, 2020, 16(2): 1-23.