基于_i_K__i_近邻和优化分配策略的密度峰值聚类算法-赵辉，王红军，彭博，龙治国，李天瑞.pdf

上善若水

170

22页

0次

2022-05-19

免费下载

软件学报 ISSN 1000-9825, CODEN RUXUEW E-mail: jos@iscas.ac.cn

Journal of Software, 2022,33(4):1390−1411 [doi: 10.13328/j.cnki.jos.006462] http://www.jos.org.cn

基于 K 近邻和优化分配策略的密度峰值聚类算法

∗

孙

林

1,2

秦小营

徐久成

薛占熬

(河南师范大学计算机科学与信息工程学院, 河南新乡 453007)

(教育人工智能与个性化学习河南省重点实验室, 河南新乡 453007)

通信作者: 徐久成, E-mail: xjc@htu.edu.cn

摘要: 密度峰值聚类(density peak clustering, DPC)是一种简单有效的聚类分析方法. 但在实际应用中, 对于簇间

密度差别大或者簇中存在多密度峰的数据集, DPC 很难选择正确的簇中心; 同时, DPC 中点的分配方法存在多米

诺骨牌效应. 针对这些问题, 提出一种基于 K 近邻(K-nearest neighbors, KNN)和优化分配策略的密度峰值聚类算

法. 首先, 基于 KNN、点的局部密度和边界点确定候选簇中心; 定义路径距离以反映候选簇中心之间的相似度,

基于路径距离提出密度因子和距离因子来量化候选簇中心作为簇中心的可能性, 确定簇中心. 然后, 为了提升点

的分配的准确性, 依据共享近邻、高密度最近邻、密度差值和 KNN 之间距离构建相似度, 并给出邻域、相似集

和相似域等概念, 以协助点的分配; 根据相似域和边界点确定初始聚类结果, 并基于簇中心获得中间聚类结果.

最后,依据中间聚类结果和相似集, 从簇中心到簇边界将簇划分为多层, 分别设计点的分配策略; 对于具体层次中

的点,基于相似域和积极域提出积极值以确定点的分配顺序, 将点分配给其积极域中占主导地位的簇, 获得最终聚

类结果. 在 11 个合成数据集和 27 个真实数据集上进行仿真实验, 与最新的基于密度峰值的聚类算法作对比, 结果

表明:

所提算法在纯度、F 度量、准确度、兰德系数、调整兰德系数和标准互信息上均表现出良好的聚类性能.

关键词: 密度峰值聚类; K 近邻; 簇中心; 积极值; 分配策略

中图法分类号: TP311

中文引用格式: 孙林, 秦小营, 徐久成, 薛占熬. 基于 K 近邻和优化分配策略的密度峰值聚类算法. 软件学报, 2022, 33(4):

1390–1411. http://www.jos.org.cn/1000-9825/6462.htm

英文引用格式: Sun L, Qin XY, Xu JC, Xue ZA. Density Peak Clustering Algorithm Based on K-nearest Neighbors and Optimized

Allocation Strategy. Ruan Jian Xue Bao/Journal of Software, 2022, 33(4): 1390−1411 (in Chinese). http://www.jos.org.cn/1000-9825/

6462.htm

Density Peak Clustering Algorithm Based on K-nearest Neighbors and Optimized Allocation

Strategy

SUN Lin

1,2

, QIN Xiao-Ying

, XU Jiu-Cheng

, XUE Zhan-Ao

(College of Computer and Information Engineering, Henan Normal University, Xinxiang 453007, China)

(Key Laboratory of Artificial Intelligence and Personalized Learning in Education of Henan Province, Xinxiang 453007, China)

Abstra ct : The density peak clustering (DPC) algorithm is a simple and effective clustering analysis algorithm. However, in real-world

practical applications, it is difficult for DPC to select the correct cluster centers for datasets with large differences of density among

clusters or multi-density peaks in clusters. Furthermore, the allocation method of points in DPC has a domino effect. To address these

issues, a density peak clustering algorithm based on the K-nearest neighbors (KNN) and the optimized allocation strategy was proposed.

First, the candidate cluster centers using the KNN, densities of points, and boundary points were determined. The path distance was

defined to reflect the similarity between the candidate cluster centers, based on which, the density factor and distance factor were

∗ 基金项目: 国家自然科学基金(62076089, 61976082, 61772176); 河南省科技攻关项目(212102210136)

本文由“面向开放场景的鲁棒机器学习”专刊特约编辑陈恩红教授、李宇峰副教授、邹权教授推荐.

收稿时间: 2021-01-10; 修改时间: 2021-07-16; 采用时间: 2021-08-27; jos 在线出版时间: 2021-10-26

孙林等: 基于 K 近邻和优化分配策略的密度峰值聚类算法

1391

proposed to quantify the possibility of candidate cluster centers as cluster centers, and then the cluster centers were determined. Second,

to improve the allocation precision of points, according to the shared nearest neighbors, high density nearest neighbor, density difference,

and distance between KNN, the similarity measures were constructed, and then some concepts of the neighborhood, similarity set, and

similarity domain were proposed to assist in the allocation of points. The initial clustering results were determined according to the

similarity domains and boundary points, and then the intermediate clustering results were achieved based on the cluster centers. Finally,

according to the intermediate clustering results and similarity set, the clusters were divided into multiple layers from the cluster centers to

the cluster boundaries, for which the allocation strategies of points were designed, respectively. To determine the allocation order of points

in the specific layer, the positive value was presented based on the similarity domain and positive domain. The point was allocated to the

dominant cluster in its positive domain. Thus, the final clustering results were obtained. The experimental results on 11 synthetic datasets

and 27 real datasets demonstrate that the proposed algorithm has sound clustering performance in metrics of the purity, F-measure,

accuracy, Rand index, adjusted Rand index, and normalized mutual information when compared with the state-of-the-art DPC algorithms.

Key words: density peak clustering; K-nearest neighbors (KNN); cluster center; positive value; allocation strategy

随着机器学习模型不断付诸开放应用场景, 其往往面临诸多不适

[1]

. 聚类是机器学习领域中一种重要的

无监督学习方法, 其鲁棒性的提升, 是机器学习方法亟待解决的问题. 聚类算法是处理大数据的关键技术之

一

[2,3]

, 目前可以粗略地分为 5 类: 划分、层次、网格、模型和密度

[4]

. 基于划分的聚类算法需要用户指定簇的

个数, 适用于凸状簇, 对异常点和噪音敏感

[5]

. 层次聚类算法有较高的计算复杂度, 对数据输入顺序敏感

[4]

基于网格的聚类算法遇到数据分布具有局部形状和局部密度的情况时, 性能会受到单元大小、边界的影响

[6]

基于模型的聚类算法的准确性取决于实验概率表达数据的能力

[7]

. 基于密度的聚类算法探测稠密区域, 将高

密度区域中的点识别为簇, 将低密度区域中的点识别为异常点和噪音

[8]

. 因而,基于密度的聚类算法可以识别

任意形状簇, 并且能够有效处理含有噪音的生物数据、多媒体数据、图像文本数据等. 但是, 密度聚类算法也

面临数据集密度可变和参数敏感等问题. 在真实世界中, 大多数数据集由非球形簇或者非椭圆形簇组成, 例

如一些以流形分布的人脸图像、手写数字图像以及超矩形分布的生物数据等. 事实上, 基于点的局部密度的

聚类算法很容易识别任意形状和任意尺寸的簇, 而且可以有效过滤掉噪音数据

[7]

. 近年来, 密度聚类算法吸

引了越来越多学者们的关注. 基于上述发现, 本文聚焦于基于密度策略的聚类算法, 以解决复杂数据尤其是

非球形数据的分类问题.

2014 年, Rodriguez 和 Laio

[7]

提出了一种密度峰值聚类算法(clustering by fast search and find of density

peaks, DPC), 当前已经涌现出了大量的 DPC 变体算法

[9,10]

. 总体来讲, DPC 能够有效探测任意形状的簇, 使用

简单, 具有非常好的鲁棒性, 但是也存在一些缺点: 对于簇间密度差别大或者簇中存在多密度峰的数据集,

DPC 难以为数据集选择到正确的簇中心; DPC 中点的分配方法存在多米诺骨牌效应, 针对该现象, DPC 目前无

有效措施. 近年来, 针对 DPC 存在的问题, 诸多学者已做了大量研究工作. 针对 DPC 不能为密度可变的数据

集选择正确的簇中心以及分配策略鲁棒性差等问题, Liu 等人

[11]

设计了相似度和两步点的分配方法, 提出了基

于共享最近邻的 DPC算法. 但该算法对噪声敏感, 计算复杂度高. 对于带有多密度峰的簇, DPC 难以确定这些

密度峰是否在同一个簇中, 且若一个点被错误分类, 则比这个点密度低的一系列点也容易被错误分类. 为了

解决这些问题, Jiang 等人

[12]

提出了基于 K 最近邻和基尼系数的自适应 DPC 算法. 但是, 该算法在高维数据上

的准确性较低. 对于簇间密度差别大的数据集, DPC 使用绝对密度难以选出正确的簇中心. 受此启发, Hou 等

人

[13]

构建了基于相对密度关系的簇中心识别标准增强 DPC 算法. 为了解决 DPC 不能识别多密度峰的簇以及

密度相对较低的簇的问题, Wang 等人

[14]

基于层次方法设计了一种多中心的 DPC 算法. 事实上, DPC 的分配策

略容易引起连锁反应, 即: 若一个点被分配到错误的簇中, 就很可能导致更多的点被分配到错误的簇中. 于

是, Seyedi 等人

[15]

提出了一种动态的基于图标签传播的 DPC 算法. 但是, 该算法在聚类高维数据时存在诸多

限制. 若 DPC 使用了不合适的截断距离, 则会导致簇中心的错误选择. 在某些情况下, 即使为截断距离设置

正确的值, 也难以在决策图中确定簇中心. 为了解决这个问题, Liu 等人

[16]

提出了一种基于 KNN 和合并策略

的自适应的 DPC 算法. 但是, 该算法在密度可变的数据集上性能较差. 由此, 针对 DPC 算法和现有改进算法

在簇中心的确定和点的分配方面的不足, 为了有效确定簇中心并提升点分配的正确率, 本文提出一种基于 K

of 22

免费下载

软件学报计算机技术

关注

评论