上海交通大学学报, 2024, 58(5): 776-782 doi: 10.16183/j.cnki.jsjtu.2022.224

新型电力系统与综合能源

基于改进DeepLabv3+的光伏电站道路识别方法

李翠明,, 王华, 徐龙儿, 王龙

兰州理工大学 机电工程学院,兰州 730050

Road Recognition Method of Photovoltaic Plant Based on Improved DeepLabv3+

LI Cuiming,, WANG Hua, XU Longer, WANG Long

School of Mechanical and Electrical Engineering, Lanzhou University of Technology, Lanzhou 730050, China

责任编辑: 孙伟

收稿日期: 2022-06-17   修回日期: 2022-07-30   接受日期: 2022-10-17  

基金资助: 甘肃省自然科学基金(18JR3RA139)
国家自然科学基金(51765031)

Received: 2022-06-17   Revised: 2022-07-30   Accepted: 2022-10-17  

作者简介 About authors

李翠明(1976-),副教授,主要从事移动机器人场景理解和导航方面的研究;E-mail:li_goddess@163.com.

摘要

针对移动清洁机器人在光伏电站作业时需要精确快速识别道路的问题,提出一种改进的DeepLabv3+目标识别模型对光伏电站道路进行识别.首先,将原DeepLabv3+模型的主干网络替换为优化的MobileNetv2网络以降低模型复杂度;其次,采用异感受野融合和空洞深度可分离卷积结合的策略改进空洞空间金字塔池化(ASPP)结构,提高ASPP的信息利用率和模型训练效率;最后,引入注意力机制,提升模型识别精度.结果表明,改进后模型的平均像素准确率为98.06%,平均交并比为95.92%,相比于DeepLabv3+基础模型分别提高了1.79个百分点、2.44个百分点,且高于SegNet、UNet模型.同时,改进后的模型参数量小,实时性好,能够更好地实现光伏电站移动清洁机器人的道路识别.

关键词: 光伏电站; 道路识别; DeepLabv3+模型; 注意力机制; MobileNetv2

Abstract

Aiming at the problem that mobile cleaning robot needs to identify road accurately and quickly when it operates in photovoltaic plants, a target recognition model of improved DeepLabv3+ to identify the roads within photovoltaic plants is proposed. First, the backbone network of the original DeepLabv3+ model is replaced with an optimized MobileNetv2 network to reduce complexity. Then, the strategy that combines diverse receptive field fusion with depth separable convolution is employed, which enhances the atrous spatial pyramid pooling (ASPP) structure and improves the information utilization of ASPP and the training efficiency of model. Finally, the attention mechanism is introduced to improve the segmentation accuracy of the model. The results show that the average pixel accuracy of the improved model is 98.06%, and the average intersection over union is 95.92%, which are 1.79 percentage points and 2.44 percentage points higher than those of the DeepLabv3+ basic model, and SegNet and UNet models. Furthermore, the improved model has fewer parameters and a good real-time performance, which can better realize the road recognition of mobile cleaning robot of photovoltaic plants.

Keywords: photovoltaic plants; road recognition; DeepLabv3+ model; attention mechanism; MobileNetv2

PDF (8455KB) 元数据 多维度评价 相关文章 导出 EndNote| Ris| Bibtex  收藏本文

本文引用格式

李翠明, 王华, 徐龙儿, 王龙. 基于改进DeepLabv3+的光伏电站道路识别方法[J]. 上海交通大学学报, 2024, 58(5): 776-782 doi:10.16183/j.cnki.jsjtu.2022.224

LI Cuiming, WANG Hua, XU Longer, WANG Long. Road Recognition Method of Photovoltaic Plant Based on Improved DeepLabv3+[J]. Journal of Shanghai Jiaotong University, 2024, 58(5): 776-782 doi:10.16183/j.cnki.jsjtu.2022.224

光伏电站作为太阳能的主要来源,其发电效率与整个光伏产业的效能息息相关.光伏组件是太阳能光伏系统的核心部件,但由于我国西北地区风沙大,其表面容易积累灰尘,削弱发电效率,需要移动清洁机器人及时进行清扫作业.清洁机器人能够准确识别光伏电站道路区域是机器人执行光伏电站清扫作业的前提.我国西北地区的光伏电站道路两侧多被杂草及碎石覆盖,路面不平整,且有光伏组件阻碍,给清洁机器人的道路识别带来很大困难.因此,针对光伏电站完全非结构化道路识别研究尤其重要.

道路识别的本质是对图像道路区域进行分割.传统的非结构化道路识别方法主要利用道路图像的特征信息实现道路分割.文献[1]中通过Gabor滤波器提取纹理方向,采用局部软投票策略估计消失点获取道路区域.文献[2]中利用颜色特征实现全局过分割,并将过分割区域再融合完成道路分割.文献[3]中使用RGB熵的方法处理道路图像,并通过改进区域生长的方法提取出道路区域.以上方法在特定场景能够完成道路区域提取,但容易受外部环境因素影响,路况和天气状况发生较大变化时,图像会难以分割,不能很好满足移动清洁机器人道路识别的高鲁棒性要求.

上述方法的分割评价多聚集在像素分割精度指标上,而清洁机器人完成清洁任务不仅需要高精度的识别,也需要模型参数量小和算法实时性好.因此,提出一种基于DeepLabv3+基础模型的光伏电站道路识别方法.采用优化的轻量级MobileNetv2网络替换原DeepLabv3+模型的主干网络Xception,减小模型参数量,加快运行速度;利用异感受野融合和空洞深度可分离卷积结合的策略改进空洞空间金字塔池化(ASPP)结构, 提高不同感受野信息间的相关性,扩张率卷积层的信息利用率及模型训练效率;在编码器部分引入卷积注意力模块(CBAM),保留更多有效的图像边缘特征信息,提高特征提取准确性,实现对光伏电站道路的分割.此外,对于光伏电站完全非结构化道路,改进后的DeepLabv3+模型对其分割效果很好,且模型兼顾了分割精度和实时性,可以应用到其他非结构化道路及结构化道路场景图像识别中,如田间道路、高速公路等,为其提供技术支持.

1 DeepLabv3+基础模型

DeepLabv3+模型基于DeepLabv3增加了解码器,构建编解码(Encoder-Decoder)网络模型,将Xception作为主干网络.在编码器中,原始图像先输入到主干网络中提取特征信息,将低层特征传入解码器,高层特征输入ASPP结构中,然后经过1×1卷积(Conv),以及不同扩张率的带孔卷积和平均池化后得到的特征图拼接融合,再利用1×1卷积减少特征通道数目.在解码器中,将编码器中获得的高层特征进行4倍上采样,与主干网络提取到的低层特征融合,经过3×3卷积和4倍上采样,输出模型分割后的图像,DeepLabv3+基础模型的网络结构如图1所示,其中r表示空洞卷积各元素的间隔.

图1

图1   DeepLabv3+基础模型的网络结构

Fig.1   Network structure of DeepLabv3+ basic model


2 改进DeepLabv3+模型

从以下3方面对DeepLabv3+基础模型进行改进.①将原DeepLabv3+的主干网络Xception替换为优化的MobileNetv2网络,降低模型参数量;②利用异感受野融合和空洞深度可分离卷积结合的策略改进ASPP结构,提高其信息利用率和模型训练效率;③引入注意力机制CBAM, 提升模型识别精度,改进DeepLabv3+模型的网络结构如图2所示,训练参数及性能见附录A.

图2

图2   改进DeepLabv3+模型的网络结构

Fig.2   Network structure of improved DeepLabv3+ model


2.1 MobileNetv2网络优化

MobileNetv2是Sandler等[13]提出的一种轻量级卷积神经网络, 其在MobileNetv1[14]的基础上,引入倒置残差和线性瓶颈层,解决了MobileNetv1中深度可分离卷积存在的输入层卷积核数量固定的瓶颈问题,且模型小、效率高.

利用MobileNetv2原始网络结构,仅采用MobileNetv2的前8层,减少计算资源的消耗.此外,第7和第8层采用空洞卷积提取特征,并将第7层的步长设为1,提高MobileNetv2网络分割的准确性,优化的MobileNetv2网络结构如表1所示.其中:t表示通道倍数;c表示通道大小;n表示重复次数;s表示步长.

表1   优化的MobileNetv2网络结构

Tab.1  Network structure of optimized MobileNetv2

输入网络层输出步长tcnsr
224×224×3conv2d232121
112×112×32bottleneck2116111
112×112×16bottleneck4624221
56×56×24bottleneck8632321
28×28×32bottleneck16664421
28×28×64bottleneck16696311
14×14×96bottleneck166160312
7×7×160bottleneck166320114

新窗口打开| 下载CSV


2.2 异感受野融合与空洞深度可分离卷积结合改进的ASPP结构

在DeepLabv3+基础模型的ASPP结构中引入深度可分离卷积,使空洞卷积替换为空洞深度可分离卷积.同时,利用ASPP中上一级扩张率卷积层输出的特征图和原特征图通道拼接后,输入到下一级扩张率卷积层提取特征的方式,实现异感受野融合,提高不同感受野信息间的相关性及卷积层的信息利用率,异感受野融合的空洞深度可分离卷积如图3所示,DS表示深度可分离卷积.

图3

图3   异感受野融合的空洞深度可分离卷积

Fig.3   Empty depth separable convolution of different-sensory field fusion


空洞深度可分离卷积是指空洞卷积利用深度可分离卷积对其通道和空间进行分离计算.深度可分离卷积由深度卷积和逐点卷积组成,具体示意如图4所示.

图4

图4   标准卷积与深度可分离卷积

Fig.4   Standard convolution and depthwise separable convolution


假设输入特征图尺寸为Di×Di,通道数为N,卷积核大小为Dj×Dj,输出特征图通道数为K,则标准卷积对应的计算量为

Q1=DiDiNKDjDj

深度可分离卷积对应的计算量为

Q2=DiDiNDjDj+NKDiDi

由此可得Q2Q1之间的比值为

Q2Q1= DiDiNDjDj+NKDiDiDiDiNKDjDj= 1K+ 1Dj2

综合上述推导,可以发现深度可分离卷积的计算量更少.利用异感受野融合与空洞深度可分离卷积结合的策略改进ASPP结构,不仅可以使其具有足够的感受野以及密集地利用多尺度信息,而且具有更少的参数量和更好的特征表达能力.

2.3 引入注意力机制的DeepLabv3+编码器模块

CBAM模块是 Woo 等[15]提出的轻量级注意力模块,其沿着通道和空间两个独立维度依次推断出注意力权重,并与输入特征相乘以对特征进行自适应调整,结构如图5所示.

图5

图5   CBAM结构

Fig.5   Structure of CBAM


通道注意力模块(CAM)是对输入特征F分别平均池化(AvgPool)和最大池化(MaxPool),得到特征FavgcFmaxc.然后将特征传入多层感知器(MLP)中处理,利用Sigmoid函数(σ)获得权重系数Mc.最后,将权重系数与特征F相乘得到新特征F'.W1W0表示多层感知器中的两层参数,通道注意力的计算公式如下所示:

Mc(F)=σ(MLP(AvgPool(F))+MLP(MaxPool(F)))=σ(W1(W0(Favgc))+W1(W0(Fmaxc)))

空间注意力模块(SAM)沿着通道维度对特征F'平均池化和最大池化,得到通道描述FavgsFmaxs,并将其堆叠.然后7×7卷积调整通道数目,通过Sigmoid函数得到权重系数Ms.最后,将特征F'与权重系数相乘得到新特征,空间注意力计算公式如下所示:

Ms(F)= σ(f7×7([AvgPool(F');MaxPool(F')]))=σ(f7×7([Favgs; Fmaxs]))

CBAM的通道和空间注意力模块可以对提取的特征进行过滤,使得编码阶段保留的信息更有利于分割的准确性.CAM突出对网络有重大影响的通道信息的学习,SAM能够更好地获取位置关系信息,从而提高网络学习能力.在DeepLabv3+模型的编码器模块中加入CBAM后,网络可以很好地学习光伏电站道路场景图像的特征,抑制冗余信息,并有助于提取道路边缘特征,使DeepLabv3+编码器部分的特征提取更加高效和准确.

3 实验与结果

3.1 数据集建立

本文所用图片采集于甘肃武威某光伏电站,使用照相机从不同位置和角度对光伏电站道路场景拍摄所得,图像分辨率为 1280 像素×720像素,共实地采集图片 1600 张.为了提高网络模型的泛化能力,采用改变亮度、对比度及旋转的方式对光伏电站道路场景数据集扩充,最后得到 2400 张图像的增强数据集,训练集与测试集分别为 1680 张和720张图像,并且使用开源标注软件Labelme对每张图像需要识别的道路区域进行手动标注.

3.2 实验环境配置及参数设置

实验所用硬件平台为AMD R7-5800H 3.2 GHz处理器、NVIDIA Geforce RTX 3060显卡、16 GB内存, 软件环境为win10 cuda10.0 tensorflow1.13.2 keras2.1.5 anaconda3 python3.7.网络模型训练时,初始学习率设置为 0.000 1,批大小(batchsize)设置为4,损失函数采用交叉熵损失函数.

3.3 评价指标

为了客观评价网络模型在光伏电站道路识别中的性能,采用平均像素准确率(MPA)和平均交并比(MIoU)作为评价指标.假设有n+1个类(n个目标类,1个背景类),mjj表示分类正确的像素数量,mjk表示属于第j类却被分到第k类的像素数量,mkj表示属于第k类却被分到第j类的像素数量.

(1) MPA.分别计算每个类别分类正确的像素数占所有预测为该类别像素数的比例,并累加求平均,计算公式为

MPA= 1n+1j=0n(mjj/ k=0nmjk)

(2) MIoU.对每个类别预测的结果和真实值的交集与并集的比值,求和再取平均值,计算公式为

MIoU= 1n+1j=0nmjjk=0nmjk+k=0nmkj-mjj

3.4 结果与分析

为验证改进DeepLabv3+模型对比DeepLabv3+基础模型的优势性, 分别对改进前后DeepLabv3+模型以同样的训练参数和数据集进行训练,其损失函数(floss)的变化曲线如图6所示.总体来看,两个模型训练集的损失值都随着迭代次数的不断增大而逐渐稳定,但改进DeepLabv3+模型收敛性更强,后期波动小.

图6

图6   损失函数变化曲线

Fig.6   Curve of loss change


训练完成后,取光伏电站道路场景数据集中的测试集图片测试改进前后模型的分割效果,改进DeepLabv3+模型与基础模型的分割效果对比如图7所示.由图可见,DeepLabv3+基础模型在光伏电站道路场景图像分割上出现了漏分割和错分割,并且对图像边缘的分割不够理想.改进DeepLabv3+模型具备更好的道路区域分割效果,保留了道路更多的细节特征,边缘识别更加清晰准确.

图7

图7   改进DeepLabv3+模型与基础模型的分割效果对比

Fig.7   Comparison of segmentation effect between improved DeepLabv3+ model and basic model


为了进一步评价改进的DeepLabv3+对于光伏电站道路分割的性能,使用SegNet模型、UNet模型、原始的DeepLabv3+在光伏电站道路场景数据集上进行训练后,然后在测试集上进行测试得到MPA、MIoU和推理时间,并与本文提出的改进DeepLabv3+模型进行对比,以比较不同的语义分割模型在光伏电站道路识别时的精度、参数量和时间复杂度,对比结果如表2所示.

表2   不同模型的精度、参数量和推理时间对比

Tab.2  Comparison of precision, number of parameters, and inference time of different models

模型MPA/%MIoU/%单张图片
推理时间/ms
总参数量×
10-6
SegNet93.8491.4212114.86
UNet94.7392.0512517.30
原始Deeplabv3+96.2793.4815641.25
改进Deeplabv3+98.0695.921122.28

新窗口打开| 下载CSV


表2结果可知,改进后DeepLabv3+模型进行光伏电站道路分割时在平均像素准确率方面比SegNet、UNet、原始DeepLabv3+分别提高了4.22个百分点、3.33个百分点、1.79个百分点.在平均交并比方面,分别提高了4.50个百分点、3.87个百分点、2.44个百分点.同时,改进后DeepLabv3+模型参数量压缩了原DeepLabv3+的94%,比SegNet小了84%,比UNet压缩了86%,并且平均每张图片的网络推理时间最少.

4 结语

为了实现移动清洁机器人对光伏电站道路的精确识别,提出改进DeepLabv3+目标识别模型.采用优化的MobileNetv2网络替换Xception作为主干网络,减小模型的参数量;通过异感受野融合和空洞深度可分离卷积结合的策略改进ASPP结构,提高对不同扩张率卷积层的信息利用率和训练速度;引入CBAM,保留道路更多的边缘信息,提升模型的识别精度.分别从训练集损失值变化、道路场景预测效果、MIoU、MPA、模型参数量和单张图片推理时间方面,对改进后模型和其他模型进行对比.在光伏电站道路场景数据集上的实验结果表明,改进后模型的MIoU和MPA分别为95.92%、98.06%,均优于对比模型,并且模型参数量和平均每张图片的网络推理时间最少,具有优秀的识别效果.下一步工作将考虑解决光伏电站道路存在大石、深坑或运维人员等障碍物时的多类别分割问题,标注更多复杂的光伏电站道路场景数据,通过增加训练数据量进一步提高改进DeepLabv3+模型的鲁棒性,并将成果应用于移动清洁机器人道路识别中.

附录见本刊网络版(xuebao.sjtu.edu.cn/article/2024/1006-2467/1006-2467-58-05-0776.shtml)

参考文献

KONG H, AUDIBERT J Y, PONCE J.

General road detection from a single image

[J]. IEEE Transactions on Image Processing, 2010, 19(8): 2211-2220.

DOI:10.1109/TIP.2010.2045715      PMID:20371404      [本文引用: 1]

Given a single image of an arbitrary road, that may not be well-paved, or have clearly delineated edges, or some a priori known color or texture distribution, is it possible for a computer to find this road? This paper addresses this question by decomposing the road detection process into two steps: the estimation of the vanishing point associated with the main (straight) part of the road, followed by the segmentation of the corresponding road area based upon the detected vanishing point. The main technical contributions of the proposed approach are a novel adaptive soft voting scheme based upon a local voting region using high-confidence voters, whose texture orientations are computed using Gabor filters, and a new vanishing-point-constrained edge detection technique for detecting road boundaries. The proposed method has been implemented, and experiments with 1003 general road images demonstrate that it is effective at detecting road regions in challenging conditions.

方浩, 贾睿, 卢嘉鹏.

基于颜色和纹理特征的道路图像分割

[J]. 北京理工大学学报, 2010, 30(8): 934-939.

[本文引用: 1]

FANG Hao, JIA Rui, LU Jiapeng.

Segmentation of full vision images based on colour and texture features

[J]. Transactions of Beijing Institute of Technology, 2010, 30(8): 934-939.

[本文引用: 1]

吴骅跃, 段里仁.

基于RGB熵和改进区域生长的非结构化道路识别方法

[J]. 吉林大学学报(工学版), 2019, 49(3): 727-735.

[本文引用: 1]

WU Huayue, DUAN Liren.

Unstructured road detection method based on RGB entropy and improved region growing

[J]. Journal of Jilin University (Engineering and Technology Edition), 2019, 49(3): 727-735.

[本文引用: 1]

SHELHAMER E, LONG J, DARRELL T.

Fully convolutional networks for semantic segmentation

[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4): 640-651.

DOI:10.1109/TPAMI.2016.2572683      PMID:27244717     

Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, improve on the previous best result in semantic segmentation. Our key insight is to build "fully convolutional" networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. We then define a skip architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional networks achieve improved segmentation of PASCAL VOC (30% relative improvement to 67.2% mean IU on 2012), NYUDv2, SIFT Flow, and PASCAL-Context, while inference takes one tenth of a second for a typical image.

BADRINARAYANAN V, KENDALL A, CIPOLLA R.

SegNet: A deep convolutional encoder-decoder architecture for image segmentation

[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(12): 2481-2495.

DOI:10.1109/TPAMI.2016.2644615      PMID:28060704     

We present a novel and practical deep fully convolutional neural network architecture for semantic pixel-wise segmentation termed SegNet. This core trainable segmentation engine consists of an encoder network, a corresponding decoder network followed by a pixel-wise classification layer. The architecture of the encoder network is topologically identical to the 13 convolutional layers in the VGG16 network [1]. The role of the decoder network is to map the low resolution encoder feature maps to full input resolution feature maps for pixel-wise classification. The novelty of SegNet lies is in the manner in which the decoder upsamples its lower resolution input feature map(s). Specifically, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling. This eliminates the need for learning to upsample. The upsampled maps are sparse and are then convolved with trainable filters to produce dense feature maps. We compare our proposed architecture with the widely adopted FCN [2] and also with the well known DeepLab-LargeFOV [3], DeconvNet [4] architectures. This comparison reveals the memory versus accuracy trade-off involved in achieving good segmentation performance. SegNet was primarily motivated by scene understanding applications. Hence, it is designed to be efficient both in terms of memory and computational time during inference. It is also significantly smaller in the number of trainable parameters than other competing architectures and can be trained end-to-end using stochastic gradient descent. We also performed a controlled benchmark of SegNet and other architectures on both road scenes and SUN RGB-D indoor scene segmentation tasks. These quantitative assessments show that SegNet provides good performance with competitive inference time and most efficient inference memory-wise as compared to other architectures. We also provide a Caffe implementation of SegNet and a web demo at http://mi.eng.cam.ac.uk/projects/segnet.

RONNEBERGER O, FISCHER P, BROX T.

UNet: Convolutional networks for biomedical image segmentation

[C]// International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham, Switzerland: Springer, 2015: 234-241.

ZHAO H S, SHI J P, QI X J, et al.

Pyramid scene parsing network

[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE, 2017: 6230-6239.

CHEN L C, ZHU Y K, PAPANDREOU G, et al.

Encoder-decoder with atrous separable convolution for semantic image segmentation

[C]// Proceedings of the European Conference on Computer Vision. Cham, Switzerland: Springer, 2018: 833-851.

CHEN L C, PAPANDREOU G, SCHROFF F, et al.

Rethinking atrous convolution for semantic image segmentation

[EB/OL]. (2017-01-01) [2021-04-08]. https://arxiv.org/abs/1706.05587.

URL    

CHOLLET F.

Xception: Deep learning with depthwise separable convolutions

[C]// IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE, 2017: 1251-1258.

BAHETI B, INNANI S, GAJRE S, et al.

Semantic scene segmentation in unstructured environment with modified DeepLabV3+

[J]. Pattern Recognition Letters, 2020, 138: 223-229.

LIU R R, HE D Z.

Semantic segmentation based on Deeplabv3+ and attention mechanism

[C]// 2021 IEEE 4th Advanced Information Management, Communicates, Electronic and Automation Control Conference. Chongqing, China: IEEE, 2021: 255-259.

SANDLER M, HOWARD A, ZHU M L, et al.MobileNetV2:

Inverted residuals and linear bottle-necks

[C]// IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake, USA: IEEE, 2018: 4510-4520.

[本文引用: 1]

HOWARD A G, ZHU M L, CHEN B, et al.

MobileNets: Efficient convolutional neural networks for mobile vision applications

[EB/OL]. (2017-04-17)[2021-04-08]. https://arxiv.org/abs/1704.04861.

URL     [本文引用: 1]

WOO S, PARK J, LEE J Y, et al.

CBAM: Convolutional block attention module

[C]// Proceedings of the European Conference on Computer Vision. Cham, Switzerland: Springer, 2018: 3-19.

[本文引用: 1]

/