密度峰值优化的分类算法研究

作者	刘学文
姓名汉语拼音	Liu Xuewen
学号	2019000010004
培养单位	兰州财经大学
电话	15707973465
电子邮件	1149651417@qq.com
入学年份	2019-9
学位类别	学术硕士
培养级别	硕士研究生
学科门类	管理学
一级学科名称	管理科学与工程
学科方向	无
学科代码	1201
授予学位	管理学硕士
第一导师姓名	聂飞平
第一导师姓名汉语拼音	niefeiping
第一导师单位	兰州财经大学
第一导师职称	教授
题名	密度峰值优化的分类算法研究
英文题名	Research on Classification Algorithm with Density Peak Optimization
关键词	密度峰值分类自训练不平衡数据分类球K均值聚类
外文关键词	Density Peaks ; Classification ; Self-Training ; Imbalanced Data Classification ; Ball K-Means Clustering
摘要	密度峰值聚类(DPC)是一种新颖的聚类算法。DPC能够利用数据的密度峰值信息发现潜在的类簇中心，基于样本之间的层次关系对任意形状数据快速分配标签。近年来，它在众多领域展示出了巨大的应用价值，并且越来越受到研究者重视。本文旨在研究密度峰值在半监督分类和不平衡数据分类领域的应用，主要工作如下：（1）自训练算法的关键步骤是选取用于扩展训练集的高置信度样本。如果选取样本的伪标签不准确，会使训练的分类器的性能降低。为此，本文利用密度峰值隶属度筛选高置信度样本，提出了一种密度峰值隶属度优化的半监督自训练算法(STDPM)。首先，基于密度峰值定义了原型和近亲结点，以便更清晰地反映样本之间的层次关系。然后，提出了一种高置信度样本选取方法，先根据无标签样本与不同类簇内有标签样本之间的原型关系定义密度峰值类簇隶属度，从无标签近亲结点集中选取隶属度大于设定阈值的样本。最后选取样本由分类器赋予伪标签并用于扩展训练集。在8个公开的基准测试集上与4个算法进行对比实验，结果验证了高置信度样本选取方法的有效性。（2）不平衡数据分类算法着重关注少数类样本的分类精度，但当多数类样本的损失过多信息时，整体分类精度会变低。为此，本文提出了一种密度峰值优化的球簇划分欠采样不平衡数据分类算法(DPBCPUSBoost)。首先，设计了一种保留高价值多数类样本的欠采样方法，利用密度峰值发现多数类中的代表性样本，并采用球簇划分方法找出决策边界区域内易误分的多数类样本，对这些样本赋予更高的采样权重，以尽量减少信息损失。然后，提出了一种融合类依赖和样本依赖的误分代价计算方法，根据样本类别分布信息计算不同类别样本的误分代价，根据密度峰值信息计算所有样本的误分代价，融合两种形式的代价作为样本的整体误分代价，新的误分代价计算方法充分考虑到了样本价值的差异性。最后，基于临时训练集训练分类器，并通过代价调整函数进一步增加高误分代价样本的权重。在10个KEEL基准测试集上与4个算法进行对比实验，结果验证了DPBCPUSBoost利用密度峰值优化欠采样方法和误分代价计算方法的有效性。
英文摘要	Density Peaks Clustering (DPC) is a novel algorithm that uses the Density Peaks information of data to find potential cluster centers and quickly assigns labels to arbitrary-shaped data based on hierarchical relationships between samples. In recent years, DPC has shown great application value in many fields and has been paid more and more attention by researchers. This paper focuses on applying Density Peaks in Semi-Supervised Classification and Imbalanced Data Classification. The main work is as follows: (1) It is the critical step to select high-confidence samples for expanding the training set in the Self-Training algorithm. It will degrade the performance of the trained classifier if pseudo-labels of selected samples are inaccurate. Therefore, we use Density Peaks Membership to select high-confidence samples and propose a Self-Training algorithm, named STDPM, for Density Peaks Membership optimization. Firstly, to more clearly reflect the hierarchical relationships between samples, we define Prototypes and Direct Relative Node base on Density Peaks. Secondly, we propose a method to select high-confidence samples. The method defines Density Peaks Membership according to the Prototype relationship between unlabeled samples and labeled samples in different clusters and then selects samples whose membership degree is greater than the set threshold from the set of Direct Relative Node. Finally, the classifier gave the selected samples pseudo-labels, and we used them to expand the training set. We conducted comparative experiments on eight public benchmark test data sets, and the experimental results verify the method's effectiveness in selecting the high-confidence samples. (2) The imbalanced data classification algorithm focuses on the classification accuracy of minority class samples. However, when the majority class samples lose too much information, the overall classification accuracy will be lower. Therefore, this paper proposes a Boosting algorithm of imbalanced data classification, named DPBCPUSBoost, based on Ball Cluster Partitioning and UnderSampling with Density Peaks optimization. Firstly, we design an undersampling method that retains the majority class samples with higher values. The method finds representative samples in the majority class cluster according to the Density Peaks information and takes the ball cluster division method to search for the majority class samples, which are easily misclassified in the decision boundary region. A higher sampling weight is assigned to those samples to reduce the loss of information. Secondly, we propose a misclassification cost calculation method that fuses class-dependent and sample-dependent. The method calculates the misclassification cost of different classes according to class distribution and calculates the misclassification cost of all samples according to Density Peaks information. Then, the method joins the two forms of cost as the overall misclassification cost of samples. The new method of misclassification cost calculation fully considers the differences in the value of samples. Finally, we train a classifier on the temporary training set, and using the cost adjustment function further increases the weight of samples with high misclassification costs. We compared four algorithms with DPBCPUSBoost on ten KEEL benchmark test data sets. The experimental results verify the effectiveness of DPBCPUSBoost in optimizing the undersampling method and misclassification cost calculation method using Density Peaks.
学位类型	硕士
答辩日期	2022-05-29
学位授予地点	甘肃省兰州市
学位专业	管理科学与工程
学科领域	管理科学与工程（可授管理学、工学学位）
研究方向	信息管理与信息系统
语种	中文
论文总页数	74
论文印刷版中手工粘贴图片页码	20,31,33,39,41,44,49,53
插图总数	10
插表总数	18
参考文献总数	85
馆藏号	0004257
保密级别	公开
中图分类号	C93/63
文献类型	学位论文
条目标识符	http://ir.lzufe.edu.cn/handle/39EH0E1M/32037
专题	信息工程与人工智能学院
推荐引用方式 GB/T 7714	刘学文. 密度峰值优化的分类算法研究[D]. 甘肃省兰州市. 兰州财经大学,2022.

条目包含的文件		下载所有文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
10741_2019000010004_（2884KB）	学位论文		开放获取	CC BY-NC-SA	浏览下载