J Shanghai Jiaotong Univ Sci ›› 2023, Vol. 28 ›› Issue (6): 728-737.doi: 10.1007/s12204-022-2465-y
曾志贤,曹建军,翁年凤,袁震,余旭
接受日期:
2021-03-29
出版日期:
2023-11-28
发布日期:
2023-12-04
ZENG Zhirian(曾志贤),CAO Jianjun*(曹建军),WENG Nianfeng(翁年凤),YUAN Zhen(袁震),YU Xu(余旭)
Accepted:
2021-03-29
Online:
2023-11-28
Published:
2023-12-04
摘要: 为了解决现有跨模态实体分辨方法容易忽略数据高维度语义信息的问题,提出了一种基于细粒度联合注意力机制的图像-文本跨模态实体分辨方法(Cross-Modal Entity Resolution for Image and Text Integrating Global and Fine-grained Joint Attention Mechanism , IGFJAM)。首先,采用特征提取网络,将跨模态数据映射至共同嵌入空间;然后,结合全局联合注意力机制和局部细粒度联合注意力机制,学习跨模态数据间的全局语义关联信息和细粒度联合语义关联信息,有效地提高模型跨模态实体分辨性能。通过在Flickr-30K和MS-COCO公开数据集中进行测试,与现有方法相比,IGFJAM在R@sum性能上分别提高了4.30%和4.54%。
中图分类号:
曾志贤,曹建军,翁年凤,袁震,余旭. 基于细粒度联合注意力机制的图像-文本跨模态实体分辨[J]. J Shanghai Jiaotong Univ Sci, 2023, 28(6): 728-737.
ZENG Zhirian(曾志贤),CAO Jianjun*(曹建军),WENG Nianfeng(翁年凤),YUAN Zhen(袁震),YU Xu(余旭). Cross-Modal Entity Resolution for Image and Text Integrating Global and Fine-Grained Joint Attention Mechanism[J]. J Shanghai Jiaotong Univ Sci, 2023, 28(6): 728-737.
[9] | LV G, CAO J, ZHENG Q, et al. Cross-modal entity resolution based on co-attentional generative adversarial network [C]//2019 4th International Conference on Multimedia Systems and Signal Processing. Guangzhou, China: ACM, 2019: 42-46. |
[1] | PENG Y X, HUANG X, ZHAO Y Z. An overview of cross-media retrieval: Concepts, methodologies, benchmarks, and challenges [J]. IEEE Transactions on Circuits and Systems for Video Technology, 2018, 28(9): 2372-2385.[2] RASIWASIA N, PEREIRA J C, COVIELLO E, et al. A new approach to cross-modal multimedia retrieval [C]//18th ACM International Conference on Multimedia. Firenze, Italy: ACM, 2010: 251-260. |
[10] | QI J W, PENG Y X, YUAN Y X. Cross-media multi-level alignment with relation attention network [C]//27th International Joint Conference on Artificial Intelligence. Stockholm, Sweden: IEEE, 2018: 892- 898. |
[3] | HOTELLING H. Relations between two sets of variates [M]//Breakthroughs in statistics. New York: Springer, 1992: 162-190. |
[11] | PENG Y, QI J, ZHUO Y. MAVA: Multi-level adaptive visual-textual alignment by cross-media bi-attention mechanism [J]. IEEE Transactions on Image Processin, 2019, 29: 2728-2741. |
[4] | JIANG B, YANG J C, LV Z H, et al. Internet crossmedia retrieval based on deep learning [J]. Journal of Visual Communication and Image Representation, 2017, 48: 356-366. |
[12] | HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 770-778. |
[5] | FROME A, CORRADO G S, SHLENS J, et al. DeViSE: A deep visual-semantic embedding model [M]//Advances in neural information processing systems 26. Red Hook, NY: Curran Associates Inc., 2013: 2121-2129. |
[13] | DENG J, DONG W, SOCHER R, et al. ImageNet: A large-scale hierarchical image database [C]//2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, FL, USA: IEEE, 2009: 248-255. |
[6] | PENG Y X, QI J W. CM-GANs: Cross-modal generative adversarial networks for common representation learning [J]. ACM Transactions on Multimedia Computing, Communications, and Applications, 2019, 15(1): 22. |
[14] | REN S Q, HE K M, GIRSHICK R, et al. Faster RCNN: Towards real-time object detection with region proposal networks [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137- 1149. |
[7] | ANDREW G, ARORA R, BILMES J, et al. Deep canonical correlation analysis [C]//Proceedings of the 30th International Conference on Machine Learning. Atlanta, USA: PMLR, 2013: 1247-1255. |
[15] | ANDERSON P, HE X D, BUEHLER C, et al. Bottomup and top-down attention for image captioning and visual question answering [C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018: 6077-6086. |
[8] | LU Y H, YU J, LIU Y B, et al. Fine-grained correlation learning with stacked co-attention networks for crossmodal information retrieval [M]//Knowledge science, engineering and management. Cham: Springer, 2018: 213-225. |
[16] | CHO K, VAN MERRIENBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation [C]//2014 Conference on Empirical Methods in Natural Language Processing. Doha, Qatar: ACM, 2014: 1724-1734. |
[9] | LV G, CAO J, ZHENG Q, et al. Cross-modal entity resolution based on co-attentional generative adversarial network [C]//2019 4th International Conference on Multimedia Systems and Signal Processing. Guangzhou, China: ACM, 2019: 42-46. |
[17] | HOCHREITER S, SCHMIDHUBER J. Long shortterm memory [J]. Neural Computation, 1997, 9(8): 1735-1780. |
[10] | QI J W, PENG Y X, YUAN Y X. Cross-media multi-level alignment with relation attention network [C]//27th International Joint Conference on Artificial Intelligence. Stockholm, Sweden: IEEE, 2018: 892- 898. |
[18] | LIN X, PARIKH D. Leveraging visual question answering for image-caption ranking [M]//Computer visionECCV 2016. Cham: Springer, 2016: 261-277. |
[11] | PENG Y, QI J, ZHUO Y. MAVA: Multi-level adaptive visual-textual alignment by cross-media bi-attention mechanism [J]. IEEE Transactions on Image Processin, 2019, 29: 2728-2741. |
[19] | YOUNG P, LAI A, HODOSH M, et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions [J]. Transactions of the Association for Computational Linguistics, 2014, 2: 67-78. |
[12] | HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 770-778. |
[20] | LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: Common objects in context [M]//Computer vision-ECCV 2014. Cham: Springer, 2014: 740-755. |
[13] | DENG J, DONG W, SOCHER R, et al. ImageNet: A large-scale hierarchical image database [C]//2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, FL, USA: IEEE, 2009: 248-255. |
[21] | LEE K H, CHEN X, HUA G, et al. Stacked cross attention for image-text matching [M]//Computer visionECCV 2018. Cham: Springer, Cham, 2018: 212-218. |
[14] | REN S Q, HE K M, GIRSHICK R, et al. Faster RCNN: Towards real-time object detection with region proposal networks [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137- 1149. |
[22] | HUANG Y, WU Q, SONG C F, et al. Learning semantic concepts and order for image and sentence matching [C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018: 6163-6171. |
[15] | ANDERSON P, HE X D, BUEHLER C, et al. Bottomup and top-down attention for image captioning and visual question answering [C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018: 6077-6086. |
[23] | FAGHRI F, FLEET D J, KIROS J R, et al. VSE++: Improving visual-semantic embeddings with hard negatives [EB/OL]. (2018-07-29). https://arxiv.org/abs/1707.05612. |
[16] | CHO K, VAN MERRIENBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation [C]//2014 Conference on Empirical Methods in Natural Language Processing. Doha, Qatar: ACM, 2014: 1724-1734. |
[24] | ZHENG Z D, ZHENG L, GARRETT M, et al. Dualpath convolutional image-text embeddings with instance loss [J]. ACM Transactions on Multimedia Computing Communications and Applications, 2020, 16(2): 1-23. |
[17] | HOCHREITER S, SCHMIDHUBER J. Long shortterm memory [J]. Neural Computation, 1997, 9(8): 1735-1780. |
[18] | LIN X, PARIKH D. Leveraging visual question answering for image-caption ranking [M]//Computer visionECCV 2016. Cham: Springer, 2016: 261-277. |
[19] | YOUNG P, LAI A, HODOSH M, et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions [J]. Transactions of the Association for Computational Linguistics, 2014, 2: 67-78. |
[20] | LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: Common objects in context [M]//Computer vision-ECCV 2014. Cham: Springer, 2014: 740-755. |
[21] | LEE K H, CHEN X, HUA G, et al. Stacked cross attention for image-text matching [M]//Computer visionECCV 2018. Cham: Springer, Cham, 2018: 212-218. |
[22] | HUANG Y, WU Q, SONG C F, et al. Learning semantic concepts and order for image and sentence matching [C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018: 6163-6171. |
[23] | FAGHRI F, FLEET D J, KIROS J R, et al. VSE++: Improving visual-semantic embeddings with hard negatives [EB/OL]. (2018-07-29). https://arxiv.org/abs/1707.05612. |
[24] | ZHENG Z D, ZHENG L, GARRETT M, et al. Dualpath convolutional image-text embeddings with instance loss [J]. ACM Transactions on Multimedia Computing Communications and Applications, 2020, 16(2): 1-23. |
[1] | 曹鹤玲a,b,刘方正a,石建树a,楚永贺a,邓淼磊a. 基于随机搜索和代码相似性的程序自动修复[J]. J Shanghai Jiaotong Univ Sci, 2023, 28(6): 738-752. |
[2] | 胡铭轩, 乔钧, 张执南. 连续康复训练动作分割与评估(网络首发)[J]. J Shanghai Jiaotong Univ Sci, 0, (): 0-. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||