J Shanghai Jiaotong Univ Sci ›› 2023, Vol. 28 ›› Issue (6): 728-737.doi: 10.1007/s12204-022-2465-y
• Computing & Computer Technologies • Previous Articles Next Articles
ZENG Zhirian(曾志贤),CAO Jianjun*(曹建军),WENG Nianfeng(翁年凤),YUAN Zhen(袁震),YU Xu(余旭)
Accepted:
2021-03-29
Online:
2023-11-28
Published:
2023-12-04
CLC Number:
ZENG Zhirian(曾志贤),CAO Jianjun*(曹建军),WENG Nianfeng(翁年凤),YUAN Zhen(袁震),YU Xu(余旭). Cross-Modal Entity Resolution for Image and Text Integrating Global and Fine-Grained Joint Attention Mechanism[J]. J Shanghai Jiaotong Univ Sci, 2023, 28(6): 728-737.
[9] | LV G, CAO J, ZHENG Q, et al. Cross-modal entity resolution based on co-attentional generative adversarial network [C]//2019 4th International Conference on Multimedia Systems and Signal Processing. Guangzhou, China: ACM, 2019: 42-46. |
[1] | PENG Y X, HUANG X, ZHAO Y Z. An overview of cross-media retrieval: Concepts, methodologies, benchmarks, and challenges [J]. IEEE Transactions on Circuits and Systems for Video Technology, 2018, 28(9): 2372-2385.[2] RASIWASIA N, PEREIRA J C, COVIELLO E, et al. A new approach to cross-modal multimedia retrieval [C]//18th ACM International Conference on Multimedia. Firenze, Italy: ACM, 2010: 251-260. |
[10] | QI J W, PENG Y X, YUAN Y X. Cross-media multi-level alignment with relation attention network [C]//27th International Joint Conference on Artificial Intelligence. Stockholm, Sweden: IEEE, 2018: 892- 898. |
[3] | HOTELLING H. Relations between two sets of variates [M]//Breakthroughs in statistics. New York: Springer, 1992: 162-190. |
[11] | PENG Y, QI J, ZHUO Y. MAVA: Multi-level adaptive visual-textual alignment by cross-media bi-attention mechanism [J]. IEEE Transactions on Image Processin, 2019, 29: 2728-2741. |
[4] | JIANG B, YANG J C, LV Z H, et al. Internet crossmedia retrieval based on deep learning [J]. Journal of Visual Communication and Image Representation, 2017, 48: 356-366. |
[12] | HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 770-778. |
[5] | FROME A, CORRADO G S, SHLENS J, et al. DeViSE: A deep visual-semantic embedding model [M]//Advances in neural information processing systems 26. Red Hook, NY: Curran Associates Inc., 2013: 2121-2129. |
[13] | DENG J, DONG W, SOCHER R, et al. ImageNet: A large-scale hierarchical image database [C]//2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, FL, USA: IEEE, 2009: 248-255. |
[6] | PENG Y X, QI J W. CM-GANs: Cross-modal generative adversarial networks for common representation learning [J]. ACM Transactions on Multimedia Computing, Communications, and Applications, 2019, 15(1): 22. |
[14] | REN S Q, HE K M, GIRSHICK R, et al. Faster RCNN: Towards real-time object detection with region proposal networks [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137- 1149. |
[7] | ANDREW G, ARORA R, BILMES J, et al. Deep canonical correlation analysis [C]//Proceedings of the 30th International Conference on Machine Learning. Atlanta, USA: PMLR, 2013: 1247-1255. |
[15] | ANDERSON P, HE X D, BUEHLER C, et al. Bottomup and top-down attention for image captioning and visual question answering [C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018: 6077-6086. |
[8] | LU Y H, YU J, LIU Y B, et al. Fine-grained correlation learning with stacked co-attention networks for crossmodal information retrieval [M]//Knowledge science, engineering and management. Cham: Springer, 2018: 213-225. |
[16] | CHO K, VAN MERRIENBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation [C]//2014 Conference on Empirical Methods in Natural Language Processing. Doha, Qatar: ACM, 2014: 1724-1734. |
[9] | LV G, CAO J, ZHENG Q, et al. Cross-modal entity resolution based on co-attentional generative adversarial network [C]//2019 4th International Conference on Multimedia Systems and Signal Processing. Guangzhou, China: ACM, 2019: 42-46. |
[17] | HOCHREITER S, SCHMIDHUBER J. Long shortterm memory [J]. Neural Computation, 1997, 9(8): 1735-1780. |
[10] | QI J W, PENG Y X, YUAN Y X. Cross-media multi-level alignment with relation attention network [C]//27th International Joint Conference on Artificial Intelligence. Stockholm, Sweden: IEEE, 2018: 892- 898. |
[18] | LIN X, PARIKH D. Leveraging visual question answering for image-caption ranking [M]//Computer visionECCV 2016. Cham: Springer, 2016: 261-277. |
[11] | PENG Y, QI J, ZHUO Y. MAVA: Multi-level adaptive visual-textual alignment by cross-media bi-attention mechanism [J]. IEEE Transactions on Image Processin, 2019, 29: 2728-2741. |
[19] | YOUNG P, LAI A, HODOSH M, et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions [J]. Transactions of the Association for Computational Linguistics, 2014, 2: 67-78. |
[12] | HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 770-778. |
[20] | LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: Common objects in context [M]//Computer vision-ECCV 2014. Cham: Springer, 2014: 740-755. |
[13] | DENG J, DONG W, SOCHER R, et al. ImageNet: A large-scale hierarchical image database [C]//2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, FL, USA: IEEE, 2009: 248-255. |
[21] | LEE K H, CHEN X, HUA G, et al. Stacked cross attention for image-text matching [M]//Computer visionECCV 2018. Cham: Springer, Cham, 2018: 212-218. |
[14] | REN S Q, HE K M, GIRSHICK R, et al. Faster RCNN: Towards real-time object detection with region proposal networks [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137- 1149. |
[22] | HUANG Y, WU Q, SONG C F, et al. Learning semantic concepts and order for image and sentence matching [C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018: 6163-6171. |
[15] | ANDERSON P, HE X D, BUEHLER C, et al. Bottomup and top-down attention for image captioning and visual question answering [C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018: 6077-6086. |
[23] | FAGHRI F, FLEET D J, KIROS J R, et al. VSE++: Improving visual-semantic embeddings with hard negatives [EB/OL]. (2018-07-29). https://arxiv.org/abs/1707.05612. |
[16] | CHO K, VAN MERRIENBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation [C]//2014 Conference on Empirical Methods in Natural Language Processing. Doha, Qatar: ACM, 2014: 1724-1734. |
[24] | ZHENG Z D, ZHENG L, GARRETT M, et al. Dualpath convolutional image-text embeddings with instance loss [J]. ACM Transactions on Multimedia Computing Communications and Applications, 2020, 16(2): 1-23. |
[17] | HOCHREITER S, SCHMIDHUBER J. Long shortterm memory [J]. Neural Computation, 1997, 9(8): 1735-1780. |
[18] | LIN X, PARIKH D. Leveraging visual question answering for image-caption ranking [M]//Computer visionECCV 2016. Cham: Springer, 2016: 261-277. |
[19] | YOUNG P, LAI A, HODOSH M, et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions [J]. Transactions of the Association for Computational Linguistics, 2014, 2: 67-78. |
[20] | LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: Common objects in context [M]//Computer vision-ECCV 2014. Cham: Springer, 2014: 740-755. |
[21] | LEE K H, CHEN X, HUA G, et al. Stacked cross attention for image-text matching [M]//Computer visionECCV 2018. Cham: Springer, Cham, 2018: 212-218. |
[22] | HUANG Y, WU Q, SONG C F, et al. Learning semantic concepts and order for image and sentence matching [C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018: 6163-6171. |
[23] | FAGHRI F, FLEET D J, KIROS J R, et al. VSE++: Improving visual-semantic embeddings with hard negatives [EB/OL]. (2018-07-29). https://arxiv.org/abs/1707.05612. |
[24] | ZHENG Z D, ZHENG L, GARRETT M, et al. Dualpath convolutional image-text embeddings with instance loss [J]. ACM Transactions on Multimedia Computing Communications and Applications, 2020, 16(2): 1-23. |
[1] | DENG Yuxin1* (邓玉欣),CHEN Zezhong1 (陈泽众),WANG Yang1(汪洋), DU Wenjie2(杜文杰),MAO Bifei3(毛碧飞), LIANG Zhizhang 3(梁智章), LIN Qiushi3(林秋诗),LI Jinghui3(李静辉). Reasoning about Software Trustworthiness with Derivation Trees [J]. J Shanghai Jiaotong Univ Sci, 2024, 29(3): 579-587. |
[2] | GAO Xiaotong11 (高晓彤), MA Yanfang1,2* (马艳芳), ZHOU Wei1 周伟). Analysis of Software Trustworthiness Based on FAHP-CRITIC Method [J]. J Shanghai Jiaotong Univ Sci, 2024, 29(3): 588-600. |
[3] | CAO Heling,a,b (曹鹤玲),LIU Fangzhenga (刘方正),SHI Jianshua (石建树),CHU Yonghea (楚永贺),DENG Miaoleia*. (邓淼磊). Random Search and Code Similarity-Based Automatic Program Repair [J]. J Shanghai Jiaotong Univ Sci, 2023, 28(6): 738-752. |
Viewed | ||||||||||||||||||||||||||||||||||||||||||||||||||
Full text 135
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||
Abstract 231
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||