基于Gaussian混合的距离度量学习数据划分方法
收稿日期: 2020-03-24
网络出版日期: 2021-03-03
基金资助
面向跨域协同医学影像新型服务模式解决方案(2017YFC0112900);人工智能医学软件测评数据库和服务平台开发(2019YFC0118803)
Data Splitting Method of Distance Metric Learning Based on Gaussian Mixed Model
Received date: 2020-03-24
Online published: 2021-03-03
针对有限样本情况下,多次训练模型时容易出现不稳定和偏差问题,提出一种基于Gaussian混合的距离度量学习数据划分方法,通过更合理地划分数据集来解决该问题.距离度量学习依靠深度神经网络优异的特征提取能力,将原始数据提取的特征嵌入到新的度量空间中;然后,在该新的度量空间中基于深层次特征使用Gaussian混合模型进行聚类分析和样本分布估计;最后,依据样本分布特点进行分层采样对数据进行合理划分.研究表明,该方法可以更好地理解数据分布的特点,获得更加合理的数据划分,进而提升模型的准确性和泛化性.
关键词: 人工智能训练; 数据集划分; 深度神经网络; Gaussian混合模型
郑德重, 杨媛媛, 谢哲, 倪扬帆, 李文涛 . 基于Gaussian混合的距离度量学习数据划分方法[J]. 上海交通大学学报, 2021 , 55(2) : 131 -140 . DOI: 10.16183/j.cnki.jsjtu.2020.082
Aimed at the problem of instability and deviation of multiple training model in limited samples, this paper proposes a method of distance metric learning based on the Gaussian mixture model, which can solve this problem more reasonably by dividing the dataset. Distance metric learning relies on the excellent feature extraction capabilities of deep neural networks to embed the original data into the new metric space. Then, based on the deep features, the Gaussian mixture model is used to cluster the analyzer and estimate the sample distribution in this new metric space. Finally, according to the characteristics of sample distribution, stratified sampling is used to reasonably divide the data. The research shows that the method proposed can better understand the characteristics of data distribution and obtain a more reasonable data division, thereby improving the accuracy and generalization of the model.
[1] | YU Y L, JI Z, GUO J C, et al. Transductive zero-shot learning with adaptive structural embedding[C]∥IEEE Transactions on Neural Networks and Learning Systems. Piscataway, NJ, USA: IEEE, 2018: 4116-4127. |
[2] | SHEN D G, WU G R, SUK H I. Deep learning in medical image analysis[J]. Annual Review of Biomedical Engineering, 2017, 19(1): 221-248. |
[3] | XIONG C M. Recent progress in deep reinforcement learning for computer vision and NLP[C]∥Proceedings of the 2017 Workshop on Recognizing Families in the Wild. New York, NY, USA: ACM Press, 2017: 1. |
[4] | MAY R J, MAIER H R, DANDY G C. Data splitting for artificial neural networks using SOM-based stratified sampling[J]. Neural Networks, 2010, 23(2): 283-294. |
[5] | ROBERTS D R, BAHN V, CIUTI S, et al. Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure[J]. Ecography, 2017, 40(8): 913-929. |
[6] | PAL S K, SINGH H P, KUMAR S, et al. A family of efficient estimators of the finite population mean in simple random sampling[J]. Journal of Statistical Computation and Simulation, 2018, 88(5): 920-934. |
[7] | REITERMANOVA Z. Data splitting [EB/OL]. (2010-06-03) [2019-12-11]. . |
[8] | BAXTER C W, STANLEY S J, ZHANG Q, et al. Developing artificial neural network models of water treatment processes: A guide for utilities[J]. Journal of Environmental Engineering and Science, 2002, 1(3): 201-211. |
[9] | SNEE R D. Validation of regression models: Methods and examples[J]. Technometrics, 1977, 19(4): 415-428. |
[10] | HADI A S, KAUFMAN L, ROUSSEEUW P J. Finding groups in data: An introduction to cluster analysis[J]. Technometrics, 1992, 34(1): 111. |
[11] | DOUZAS G, BACAO F. Self-organizing map oversampling (SOMO) for imbalanced data set learning[J]. Expert Systems with Applications, 2017, 82: 40-52. |
[12] | SUáREZ J L, GARCíA S, HERRERA F. A tutorial on distance metric learning: Mathematical foundations, algorithms and experiments [EB/OL]. (2018-12-14)[2019-12-11]. |
[13] | FERNáNDEZ J J M, MAYERLE R. Sample selection via angular distance in the space of the arguments of an artificial neural network[J]. Computers & Geosciences, 2018, 114: 98-106. |
[14] | BAGLAEVA E M, SERGEEV A P, SHICHKIN A V, et al. The effect of splitting of raw data into training and test subsets on the accuracy of predicting spatial distribution by a multilayer perceptron[J]. Mathematical Geosciences, 2020, 52(1): 111-121. |
[15] | HE X W, ZHOU Y, ZHOU Z C, et al. Triplet-center loss for multi-view 3D object retrieval[C]∥2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, NJ, USA: IEEE, 2018: 1945-1954. |
[16] | ALONSO A G. Probability density imputation of missing data with Gaussian Mixture Models[D]. Massachusetts, USA: University of Oxford, 2017. |
[17] | SILVA D S F, DEUTSCH C V. Multivariate data imputation using Gaussian mixture models[J]. Spatial Statistics, 2018, 27: 74-90. |
[18] | ZONG B, SONG Q, MIN M R, et al. Deep autoencoding Gaussian Mixture Model for unsupervised anomaly detection[C]∥Sixth International Conference on Learning Representations. Vancouver, Canada: ICLR, 2018: 1-19. |
[19] | LI L S, HANSMAN R J, PALACIOS R, et al. Anomaly detection via a Gaussian Mixture Model for flight operation and safety monitoring[J]. Transportation Research Part C: Emerging Technologies, 2016, 64: 45-57. |
[20] | FAN Y X, WEN G J, LI D R, et al. Video anomaly detection and localization via Gaussian Mixture Fully Convolutional Variational Autoencoder[J]. Computer Vision and Image Understanding, 2020, 195: 102920. |
[21] | MA J Y, JIANG J J, LIU C Y, et al. Feature guided Gaussian mixture model with semi-supervised EM and local geometric constraint for retinal image registration[J]. Information Sciences, 2017, 417: 128-142. |
[22] | HUANG T, PENG H, ZHANG K. Model selection for Gaussian mixture models[J]. Statistica Sinica, 2017: 147-169. |
/
〈 |
|
〉 |