上海交通大学学报 ›› 2026, Vol. 60 ›› Issue (1): 112-122.doi: 10.16183/j.cnki.jsjtu.2024.198

• 电子信息与电气工程 • 上一篇    下一篇

基于增量加权的概念漂移数据流分类算法

吴勇华1, 梅颖2,3, 卢诚波2,3()   

  1. 1 浙江理工大学 计算机科学与技术学院, 杭州 310018
    2 丽水学院 数学与计算机学院, 浙江 丽水 323000
    3 浙江得图网络有限公司, 浙江 丽水 323000
  • 收稿日期:2024-05-29 修回日期:2024-08-26 接受日期:2024-09-04 出版日期:2026-01-28 发布日期:2026-01-27
  • 通讯作者: 卢诚波 E-mail:lu.chengbo@aliyun.com.
  • 作者简介:吴勇华(1998—),硕士生,从事数据挖掘研究.
  • 基金资助:
    国家自然科学基金(12171217)

Concept Drift Data Stream Classification Algorithm Based on Incremental Weighting

WU Yonghua1, MEI Ying2,3, LU Chengbo2,3()   

  1. 1 School of Computer Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310018, China
    2 School of Mathematics and Computer, Lishui University, Lishui 323000, Zhejiang, China
    3 Zhejiang Detu Network Co., Ltd., Lishui 323000, Zhejiang, China
  • Received:2024-05-29 Revised:2024-08-26 Accepted:2024-09-04 Online:2026-01-28 Published:2026-01-27
  • Contact: LU Chengbo E-mail:lu.chengbo@aliyun.com.

摘要:

概念漂移是数据流挖掘中最常见的现象之一,数据流中隐含的知识模式随时间动态变化,导致先前建立的分类器的准确性下降.针对这一问题,提出基于增量加权的概念漂移数据流分类(SCIW)算法.该算法采用启发式的权重更新策略,结合基于准确性差异的自适应方法,同时改进了基于泊松分布的重采样策略.SCIW算法能够适应不同类型的概念漂移,有效缓解了分类器准确率下降的问题.在14个合成数据集和6个真实数据集上的实验结果表明,SCIW算法和自适应随机森林(ARF)算法在准确率方面表现出色,明显优于其他对比算法;SCIW算法在时间和内存消耗方面明显优于ARF算法,总体平均时间消耗约为ARF的83%,总体平均内存消耗约为ARF算法的13%.

关键词: 数据流, 概念漂移, 分类算法, 集成学习, 自适应

Abstract:

Concept drift is one of the most common phenomena in data stream mining, where the underlying knowledge patterns in the data stream change dynamically over time, leading to a decline in the accuracy of previously established classifiers. To address this issue, we propose a concept drift data stream classification algorithm based on incremental weighting abbreviated as SCIW algorihtm. This algorithm employs a heuristic weight updating strategy combined with an adaptive method based on accuracy differences, and improves the Poisson distribution-based resampling strategy. The SCIW is capable of adapting to various concept drifts, effectively mitigating the decline in classifier accuracy. Experimental results on 14 synthetic datasets and 6 real-world datasets demonstrate that SCIW and adaptive random forests (ARF) outperform other algorithms in terms of accuracy. Additionally, SCIW significantly excels ARF in terms of time and memory consumption, with the overall average time consumption being approximately 83% of that of ARF and the overall average memory consumption being approximately 13% of that of ARF algorithm.

Key words: data stream, concept drift, classification algorithm, ensemble learning, adaptive

中图分类号: