J Shanghai Jiaotong Univ Sci ›› 2023, Vol. 28 ›› Issue (6): 728-737.doi: 10.1007/s12204-022-2465-y

• • 上一篇    下一篇

基于细粒度联合注意力机制的图像-文本跨模态实体分辨

曾志贤,曹建军,翁年凤,袁震,余旭   

  1. (国防科技大学 第六十三研究所,南京210007)
  • 接受日期:2021-03-29 出版日期:2023-11-28 发布日期:2023-12-04

Cross-Modal Entity Resolution for Image and Text Integrating Global and Fine-Grained Joint Attention Mechanism

ZENG Zhirian(曾志贤),CAO Jianjun*(曹建军),WENG Nianfeng(翁年凤),YUAN Zhen(袁震),YU Xu(余旭)   

  1. (The Sixty-third Research Institute, National University of Defense Technology, Nanjing 210007, China)
  • Accepted:2021-03-29 Online:2023-11-28 Published:2023-12-04

摘要: 为了解决现有跨模态实体分辨方法容易忽略数据高维度语义信息的问题,提出了一种基于细粒度联合注意力机制的图像-文本跨模态实体分辨方法(Cross-Modal Entity Resolution for Image and Text Integrating Global and Fine-grained Joint Attention Mechanism , IGFJAM)。首先,采用特征提取网络,将跨模态数据映射至共同嵌入空间;然后,结合全局联合注意力机制和局部细粒度联合注意力机制,学习跨模态数据间的全局语义关联信息和细粒度联合语义关联信息,有效地提高模型跨模态实体分辨性能。通过在Flickr-30K和MS-COCO公开数据集中进行测试,与现有方法相比,IGFJAM在R@sum性能上分别提高了4.30%和4.54%。

关键词: 跨模态实体分辨,联合注意力机制,深度学习,特征提取,语义关联

Abstract: In order to solve the problem that the existing cross-modal entity resolution methods easily ignore the high-level semantic informational correlations between cross-modal data, we propose a novel cross-modal entity resolution for image and text integrating global and fine-grained joint attention mechanism method. First, we map the cross-modal data to a common embedding space utilizing a feature extraction network. Then, we integrate global joint attention mechanism and fine-grained joint attention mechanism, making the model have the ability to learn the global semantic characteristics and the local fine-grained semantic characteristics of the cross-modal data, which is used to fully exploit the cross-modal semantic correlation and boost the performance of cross-modal entity resolution. Moreover, experiments on Flickr-30K and MS-COCO datasets show that the overall performance of R@sum outperforms by 4.30% and 4.54% compared with 5 state-of-the-art methods, respectively, which can fully demonstrate the superiority of our proposed method.

Key words: cross-modal entity resolution, joint attention mechanism, deep learning, feature extraction, semantic correlation

中图分类号: