上海交通大学学报 ›› 2021, Vol. 55 ›› Issue (5): 607-614.doi: 10.16183/j.cnki.jsjtu.2020.120

所属专题: 《上海交通大学学报》2021年12期专题汇总专辑 《上海交通大学学报》2021年“自动化技术、计算机技术”专题

• • 上一篇    下一篇

融合FCN和LSTM的视频异常事件检测

武光利1,2(), 郭振洲1, 李雷霆1, 王成祥1   

  1. 1.甘肃政法大学 网络空间安全学院,兰州 730070
    2.西北民族大学 中国民族语言文字信息技术教育部重点实验室,兰州 730070
  • 收稿日期:2020-04-26 出版日期:2021-05-28 发布日期:2021-06-01
  • 作者简介:武光利(1981-),男,山东省潍坊市人,教授,现主要从事信息内容安全、人工智能等研究.电话(Tel.):0931-7601406;E-mail: 272956638@qq.com.
  • 基金资助:
    甘肃省自然科学基金(17JR5RA161);甘肃省青年科技基金计划(18JR3RA193);甘肃省高等学校科研项目(2017A-068)

Video Abnormal Detection Combining FCN with LSTM

WU Guangli1,2(), GUO Zhenzhou1, LI Leiting1, WANG Chengxiang1   

  1. 1. School of Cyber Security, Gansu University of Political Science and Law, Lanzhou 730070, China
    2. Key Laboratory of China’s Ethnic Languages and Information Technology of the Ministry of Education, Northwest Minzu University, Lanzhou 730030, China;
  • Received:2020-04-26 Online:2021-05-28 Published:2021-06-01

摘要:

针对传统视频异常检测模型的缺点,提出一种融合全卷积神经(FCN)网络和长短期记忆(LSTM)网络的网络结构.该网络结构可以进行像素级预测,并能精确定位异常区域.首先,利用卷积神经网络提取视频帧不同深度的图像特征;然后,把不同的图像特征分别输入记忆网络分析时间序列的语义信息,并通过残差结构融合图像特征和语义信息;同时,采用跳级结构集成多模态下的融合特征并进行上采样,最终获得与原视频帧大小相同的预测图.所提网络结构模型在加州大学圣地亚哥分校(UCSD)异常检测数据集的ped 2子集和明尼苏达大学(UMN)人群活动数据集上进行测试,均取得了较好的结果.在UCSD上的等错误率低至6.6%,曲线下面积达到了98.2%, F1分数达到了94.96%;在UMN上的等错误率低至7.1%,曲线下面积达到了93.7%,F1分数达到了94.46%.

关键词: 计算机视觉, 视频异常检测, 像素级预测, 全卷积神经网络, 长短期记忆网络

Abstract:

In view of the shortcomings of the traditional video anomaly detection model, a network structure combining the fully convolutional neural (FCN) network and the long short-term memory (LSTM)network is proposed. The network can perform pixel-level prediction and can accurately locate abnormal areas. The network first uses the convolutional neural network to extract image features of different depths in video frames. Then, different image features are input to memory network to analyze semantic information on time series. Image features and semantic information are fused through residual structure. At the same time, the skip structure is used to integrate the fusion features in multi-mode and upsampling is conducted to obtain a prediction image with the same size as the original video frame. The proposed model is tested on the ped 2 subset of University of California, San Diego (UCSD) anomaly detection dataset and University of Minnesota System(UMN)crowd activity dataset. And both two datasets achieve good results. On the UCSD dataset, the equal error rate is as low as 6.6%, the area under curve reaches 98.2%, and the F1 score reaches 94.96%. On the UMN dataset, the equal error rate is as low as 7.1%, the area under curve reaches 93.7%, and the F1 score reaches 94.46%.

Key words: computer vision, video abnormal detection, pixel-level prediction, full convolutional neural (FCN) network, long short-term memory (LSTM) network

中图分类号: