[9] |
LV G, CAO J, ZHENG Q, et al. Cross-modal entity resolution based on co-attentional generative adversarial network [C]//2019 4th International Conference on Multimedia Systems and Signal Processing. Guangzhou, China: ACM, 2019: 42-46.
|
[1] |
PENG Y X, HUANG X, ZHAO Y Z. An overview of cross-media retrieval: Concepts, methodologies, benchmarks, and challenges [J]. IEEE Transactions on Circuits and Systems for Video Technology, 2018, 28(9): 2372-2385.[2] RASIWASIA N, PEREIRA J C, COVIELLO E, et al. A new approach to cross-modal multimedia retrieval [C]//18th ACM International Conference on Multimedia. Firenze, Italy: ACM, 2010: 251-260.
|
[10] |
QI J W, PENG Y X, YUAN Y X. Cross-media multi-level alignment with relation attention network [C]//27th International Joint Conference on Artificial Intelligence. Stockholm, Sweden: IEEE, 2018: 892- 898.
|
[3] |
HOTELLING H. Relations between two sets of variates [M]//Breakthroughs in statistics. New York: Springer, 1992: 162-190.
|
[11] |
PENG Y, QI J, ZHUO Y. MAVA: Multi-level adaptive visual-textual alignment by cross-media bi-attention mechanism [J]. IEEE Transactions on Image Processin, 2019, 29: 2728-2741.
|
[4] |
JIANG B, YANG J C, LV Z H, et al. Internet crossmedia retrieval based on deep learning [J]. Journal of Visual Communication and Image Representation, 2017, 48: 356-366.
|
[12] |
HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 770-778.
|
[5] |
FROME A, CORRADO G S, SHLENS J, et al. DeViSE: A deep visual-semantic embedding model [M]//Advances in neural information processing systems 26. Red Hook, NY: Curran Associates Inc., 2013: 2121-2129.
|
[13] |
DENG J, DONG W, SOCHER R, et al. ImageNet: A large-scale hierarchical image database [C]//2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, FL, USA: IEEE, 2009: 248-255.
|
[6] |
PENG Y X, QI J W. CM-GANs: Cross-modal generative adversarial networks for common representation learning [J]. ACM Transactions on Multimedia Computing, Communications, and Applications, 2019, 15(1): 22.
|
[14] |
REN S Q, HE K M, GIRSHICK R, et al. Faster RCNN: Towards real-time object detection with region proposal networks [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137- 1149.
|
[7] |
ANDREW G, ARORA R, BILMES J, et al. Deep canonical correlation analysis [C]//Proceedings of the 30th International Conference on Machine Learning. Atlanta, USA: PMLR, 2013: 1247-1255.
|
[15] |
ANDERSON P, HE X D, BUEHLER C, et al. Bottomup and top-down attention for image captioning and visual question answering [C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018: 6077-6086.
|
[8] |
LU Y H, YU J, LIU Y B, et al. Fine-grained correlation learning with stacked co-attention networks for crossmodal information retrieval [M]//Knowledge science, engineering and management. Cham: Springer, 2018: 213-225.
|
[16] |
CHO K, VAN MERRIENBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation [C]//2014 Conference on Empirical Methods in Natural Language Processing. Doha, Qatar: ACM, 2014: 1724-1734.
|
[9] |
LV G, CAO J, ZHENG Q, et al. Cross-modal entity resolution based on co-attentional generative adversarial network [C]//2019 4th International Conference on Multimedia Systems and Signal Processing. Guangzhou, China: ACM, 2019: 42-46.
|
[17] |
HOCHREITER S, SCHMIDHUBER J. Long shortterm memory [J]. Neural Computation, 1997, 9(8): 1735-1780.
|
[10] |
QI J W, PENG Y X, YUAN Y X. Cross-media multi-level alignment with relation attention network [C]//27th International Joint Conference on Artificial Intelligence. Stockholm, Sweden: IEEE, 2018: 892- 898.
|
[18] |
LIN X, PARIKH D. Leveraging visual question answering for image-caption ranking [M]//Computer visionECCV 2016. Cham: Springer, 2016: 261-277.
|
[11] |
PENG Y, QI J, ZHUO Y. MAVA: Multi-level adaptive visual-textual alignment by cross-media bi-attention mechanism [J]. IEEE Transactions on Image Processin, 2019, 29: 2728-2741.
|
[19] |
YOUNG P, LAI A, HODOSH M, et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions [J]. Transactions of the Association for Computational Linguistics, 2014, 2: 67-78.
|
[12] |
HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 770-778.
|
[20] |
LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: Common objects in context [M]//Computer vision-ECCV 2014. Cham: Springer, 2014: 740-755.
|
[13] |
DENG J, DONG W, SOCHER R, et al. ImageNet: A large-scale hierarchical image database [C]//2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, FL, USA: IEEE, 2009: 248-255.
|
[21] |
LEE K H, CHEN X, HUA G, et al. Stacked cross attention for image-text matching [M]//Computer visionECCV 2018. Cham: Springer, Cham, 2018: 212-218.
|
[14] |
REN S Q, HE K M, GIRSHICK R, et al. Faster RCNN: Towards real-time object detection with region proposal networks [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137- 1149.
|
[22] |
HUANG Y, WU Q, SONG C F, et al. Learning semantic concepts and order for image and sentence matching [C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018: 6163-6171.
|
[15] |
ANDERSON P, HE X D, BUEHLER C, et al. Bottomup and top-down attention for image captioning and visual question answering [C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018: 6077-6086.
|
[23] |
FAGHRI F, FLEET D J, KIROS J R, et al. VSE++: Improving visual-semantic embeddings with hard negatives [EB/OL]. (2018-07-29). https://arxiv.org/abs/1707.05612.
|
[16] |
CHO K, VAN MERRIENBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation [C]//2014 Conference on Empirical Methods in Natural Language Processing. Doha, Qatar: ACM, 2014: 1724-1734.
|
[24] |
ZHENG Z D, ZHENG L, GARRETT M, et al. Dualpath convolutional image-text embeddings with instance loss [J]. ACM Transactions on Multimedia Computing Communications and Applications, 2020, 16(2): 1-23.
|
[17] |
HOCHREITER S, SCHMIDHUBER J. Long shortterm memory [J]. Neural Computation, 1997, 9(8): 1735-1780.
|
[18] |
LIN X, PARIKH D. Leveraging visual question answering for image-caption ranking [M]//Computer visionECCV 2016. Cham: Springer, 2016: 261-277.
|
[19] |
YOUNG P, LAI A, HODOSH M, et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions [J]. Transactions of the Association for Computational Linguistics, 2014, 2: 67-78.
|
[20] |
LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: Common objects in context [M]//Computer vision-ECCV 2014. Cham: Springer, 2014: 740-755.
|
[21] |
LEE K H, CHEN X, HUA G, et al. Stacked cross attention for image-text matching [M]//Computer visionECCV 2018. Cham: Springer, Cham, 2018: 212-218.
|
[22] |
HUANG Y, WU Q, SONG C F, et al. Learning semantic concepts and order for image and sentence matching [C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018: 6163-6171.
|
[23] |
FAGHRI F, FLEET D J, KIROS J R, et al. VSE++: Improving visual-semantic embeddings with hard negatives [EB/OL]. (2018-07-29). https://arxiv.org/abs/1707.05612.
|
[24] |
ZHENG Z D, ZHENG L, GARRETT M, et al. Dualpath convolutional image-text embeddings with instance loss [J]. ACM Transactions on Multimedia Computing Communications and Applications, 2020, 16(2): 1-23.
|