基于可辨识矩阵的完全自适应2D特征选择算法-谢娟英，吴肇中.pdf

上善若水

271

16页

0次

2022-05-19

免费下载

软件学报 ISSN 1000-9825, CODEN RUXUEW E-mail: jos@iscas.ac.cn

Journal of Software, 2022,33(4):1338−1353 [doi: 10.13328/j.cnki.jos.006466] http://www.jos.org.cn

基于可辨识矩阵的完全自适应 2D 特征选择算法

∗

谢娟英

吴肇中

(陕西师范大学计算机科学学院, 陕西西安 710119)

通信作者: 谢娟英, E-mail: xiejuany@snnu.edu.cn

摘要: 针对基于信息增益与皮尔森相关系数的特征选择算法 FSIP (feature selection based on information gain and

Pearson correlation coefficient)存在的特征子集选取需要人工参与的问题, 提出基于可辨识矩阵的完全自适应2D 特

征选择算法 DFSIP (discernibility based FSIP). DFSIP 算法完全自适应地发现特征子集, 每次选择当前特征中最重

要的一个特征, 并以此特征约简可辨识矩阵, 剔除冗余特征, 最终自适应地获得最优特征子集. 依据最优特征子

集构建 K-ELM 分类器来评价最优特征子集的类别辨识能力. 在基因数据集的实验测试以及与 FSIP, mRMR, LLE

Score, DRJMIM, AVC, AMID 算法的实验比较和统计重要性检测表明: DFSIP 算法能够自动选择出辨识能力更强的

特征子集, 基于此特征子集的分类器具有很好的分类性能.

关键词: 可辨识矩阵; 特征辨识度; 特征独立性; 特征选择; 信息增益; 皮尔森相关系数

中图法分类号: TP18

中文引用格式: 谢娟英, 吴肇中. 基于可辨识矩阵的完全自适应 2D 特征选择算法. 软件学报, 2022, 33(4): 1338–1353.

http://www.jos.org.cn/1000-9825/6466.htm

英文引用格式: Xie JY, Wu ZZ. Totally Adaptive 2D Feature Selection Algorithm Based on Discernibility Matrix. Ruan Jian Xue

Bao/ Journal of Software, 2022, 33(4): 1338−1353 (in Chinese). http://www.jos.org.cn/1000-9825/6466.htm

Totally Adaptive 2 D Feature Selec tion Algorithm Based on Disc ernibil ity Ma trix

XIE Juan-Ying, WU Zhao-Zhong

(School of Computer Science, Shaanxi Normal University, Xi’an 710119, China)

Abstra ct : To overcome the limitations of the FSIP (feature selection based on information gain and Pearson correlation coefficient)

feature selection algorithm that need human to determine the borderline to detect the feature subsets, the totally adaptive 2D feature

selection algorithm is proposed in this study based on discernibility matrix. It is referred to as DFSIP (discernibility based FSIP). DFSIP

introduces discernibility matrix into the feature selection process of FSIP. It first initializes the candidate feature set comprising all

features and constructs the initial discernibility matrix, then it detects the most significant feature from the current candidate feature set,

so as to add it to feature subset and use it to reduce the discernibility matrix. After that the candidate feature set is updated using the union

of the cells of the reduced discernibility matrix, and the most significant feature is detected from the current candidate feature set again, so

as to put it into the feature subset and use it to reduce the discernibility matrix, and the candidate feature set is updated again. This process

repeats till there is not any feature left in the candidate feature set. The power of DFSIP is tested on very famous gene expression datasets,

and its performance is compared with that of the popular feature selection algorithms including FSIP, mRMR, LLE Score, DRJMIM, AV C,

and AMID by comparing the performance of the K-ELM classifier built using the feature subset detected by these feature selection

algorithms. In addition, the significant test is done to verify whether or not there is the significant difference between DFSIP and FSIP as

well as other compared feature selection algorithms. The experimental results demonstrate that DFSIP is superior to the compared ones,

especially it has the significant difference to LLE Score, DRJMIM, and AMID feature selection algorithms. Although there is not

∗ 基金项目: 国家自然科学基金(62076159, 61673251, 12031010); 国家重点研发计划(2016YFC0901900); 中央高校基本科研业

务费专项资金(GK202105003); 研究生培养创新基金(2016CSY009, 2018TS078)

本文由“面向开放场景的鲁棒机器学习”专刊特约编辑陈恩红教授、李宇峰副教授、邹权教授推荐.

收稿时间: 2021-03-10; 修改时间: 2021-07-16; 采用时间: 2021-08-27; jos 在线出版时间: 2021-10-26

谢娟英等: 基于可辨识矩阵的完全自适应 2D 特征选择算法

1339

significant difference between DFSIP and FSIP, it defeats FSIP in performance. It can be concluded that DFSIP can totally adaptively

detect the feature subset with sound classification capability.

Key words: discernibility matrix; feature discernibility; feature independence; feature selection; information gain; Pearson correlation

coefficient

科学与技术的发展, 带来跨学科领域的研究与日俱增, 基于人工智能技术的生物医学大数据分析得到机

器学习、数据挖掘等人工智能领域学者的关注

[1−3]

. 然而, 生物医学数据往往具有高维小样本特点, 特征数远

多于样本数, 引发维数灾难, 带来大量冗余或无关特征. 剔除冗余和无关特征是分析该类生物医学大数据的

基础和首要步骤, 使得特征选择成为当前研究热点之一

[4−8]

特征选择研究依据搜索策略可分为基于全局最优搜索策略的特征选择方法、采用随机搜索策略的特征选

择方法、采用启发式搜索策略的特征选择方法, 依据与分类器的关系则可分为 Filter 特征选择方法、Wrapper

特征选择方法和 Embedded 特征选择方法等

[9]

. Filter 特征选择方法不依赖于具体的学习机, 依据给定的评价准

则选择相应的特征构成特征子集, 速度快, 但是需要事先给定阈值作为停止准则. 距离度量

[10−12]

、一致性度

量

[13−17]

、相关性度量

[18−21]

和信息度量

[22−32]

是 Filter 特征选择算法的常用评价准则

[33]

, 如 Laplacian 得分

[34]

、

Constraint 得分

[20]

、Fisher 得分

[35]

、Pearson 相关系数

[36]

、互信息

[23]

、MIC

[32]

等. Wrapper 方法依赖于具体的学

习机, 以学习机的分类性能评价特征子集的分类能力, 需要将训练集分为训练子集和验证子集, 非常费时, 且

存在过适应风险. Embedded 方法也依赖于学习机, 但是与 Wrapper 方法不同, Embedded 方法不需要将训练集

划分为训练子集和验证子集, 特征选择在优化学习机目标函数的过程中实现, 其缺点是设计优化目标函数非

常困难. Filter 方法由于快速、不存在过适应而得到广泛应用和研究.

针对 Filter 方法需要给定阈值的缺陷, FSIP (Feature Selection based on Information gain and Pearson

correlation coefficient)算法

[37]

提出了基于特征辨识度与独立性的 2D 可视化特征选择思想, 以信息增益定义特

征辨识度, 以 Pearson 相关系数定义特征的独立性, 构造以辨识度和独立性分别作为横、纵坐标的 2D 空间, 所

有特征被展示在该 2D 空间, 使得辨识度和独立性都很强的特征位于空间右上角区域, 远离右下角区域的特

征. 为了量化特征对于分类的贡献, 定义特征的重要度为其辨识度与独立性之积, 及其坐标确定的矩形面积,

选择对分类贡献远大于其余特征的特征构成特征子集. 但是 FSIP 算法需要人为观测特征的 2D 空间分布, 实

现特征选择, 没有实现特征选择的完全自动化. 为此, 本文提出 DFSIP (discernibility based FSIP)算法, 以期完

全自适应地发现特征子集, 实现完全自动化的特征选择. DFSIP 算法引入可辨识矩阵, 选取当前最优特征, 用

当前最优特征约简可辨识矩阵, 约简后的可辨识矩阵的非空元素之并集构成新候选特征, 从候选特征中选择

最优特征, 以该最优特征再次约简可辨识矩阵, 新约简后的可辨识矩阵的非空元素再构成候选特征, 再选择

当前候选特征中的最优特征. 反复迭代, 直至可辨识矩阵的每个元素为空, 也即候选特征集合为空集停止. 此

时, 被选择的最优特征构成特征子集. DFSIP 使 FSIP 算法的人工参与选择特征子集的过程升级为完全自动地

选择特征子集的过程, 实现了特征子集的完全自适应发现.

1 信息熵

1.1 信息熵

熵(entropy)是 1877 年物理学家玻尔兹曼

[38]

提出的一种状态函数, 被用于表示系统的状态, 系统越无序,

其熵越大. 后被使用到信息论领域, 用于表示系统的信息含量, 即系统越有序、确定, 其信息熵值越小.

假设一个信息系统的变量有 m 个, 表示为集合 U={u

,…,u

}, p(u

)是变量 u

的概率, 则该系统的 m 个变

量可视为一个随机变量的 m 种取值, 那么该系统的信息熵可以表达为集合 U 的信息熵 H(U), 定义为公式(1):

() ()log() ()log()

Hpupupupu

=− =−

∑∑

U (1)

H(U)的值越大, 代表该系统越不稳定, 不确定性越大, 包含信息量越多.

of 16

免费下载

软件学报计算机技术

关注

评论