线性投影的高维数据聚类算法研究

作者	吴义稳
姓名汉语拼音	Wu Yiwen
学号	2021000010014
培养单位	兰州财经大学
电话	17826834140
电子邮件	2516482760@qq.com
入学年份	2021-9
学位类别	学术硕士
培养级别	硕士研究生
学科门类	管理学
一级学科名称	管理科学与工程
学科方向	无
学科代码	1201
授予学位	管理学硕士
第一导师姓名	聂飞平
第一导师姓名汉语拼音	Nie Feiping
第一导师单位	兰州财经大学
第一导师职称	教授
题名	线性投影的高维数据聚类算法研究
英文题名	A Study on Clustering Algorithms for High-dimensional Data with Linear Projection
关键词	降维聚类锚点线性投影实例惩罚
外文关键词	Dimensionality reduction ; Clustering ; Anchor ; Linear projection ; Instance penalty
摘要	随着经济社会的发展，数据呈现爆炸性增长，这种增长主要体现在数据的数量和维度。高维数据的样本稀疏且包含大量的冗余特征和噪声信息，其有意义的类簇结构往往嵌入在低维子空间中。对高维数据进行降维是提高聚类算法性能的一个关键步骤。线性投影技术是常用的降维方法，其通过线性变换将高维数据变换到低维子空间中。传统的聚类算法在对高维数据进行聚类时主要存在以下两个问题：（1）对异常样本敏感，缺乏识别和处理异常样本的过程；（2）需要预先对高维数据进行降维处理，算法的时间复杂度通常较高，不适用于高维的大数据集。针对这些问题，本文提出了两个算法。一、提出了带有实例惩罚的投影模糊C均值聚类算法（Projected Fuzzy c-means Clustering Algorithm with Instance Penalty，PCIP）。PCIP算法将聚类任务和降维任务统一到一个目标函数中，迭代地学习线性投影矩阵和隶属度矩阵，同时完成降维和聚类。此外，PCIP算法基于原始数据的分布构建实例惩罚矩阵，为每一个样本分配实例惩罚系数，降低了噪声样本对模型的影响。PCIP算法还通过融合模糊C均值聚类（Fuzzy c-means Clustering，FCM）和主成分分析（Principal Component Analysis，PCA）来构造一个新颖的高维数据聚类模型。PCIP的时间复杂度与样本数量线性相关，可以有效地处理大型数据集。为了验证PCIP算法的有效性，在10个图像数据集上与7个相关的对比算法进行了大量的聚类实验，实验结果证明了PCIP算法能够快速地学习投影矩阵和隶属度矩阵，并在投影子空间上获得良好的聚类效果。二、提出了快速锚点图保持投影算法（Fast Anchor Graph Preserving Projections，FAGPP）。为了减少基于图的降维算法的时间复杂度，FAGPP算法使用锚点图的学习代替邻接图的学习，同时利用原始数据空间的锚点图来指导投影空间中锚点图的学习。FAGPP算法还融合了PCA模型使其不仅可以处理数据的聚类信息，还可以处理数据的全局信息。FAGPP的时间复杂度为O(nmd) ，其中 n表示样本的数量， m表示锚的数量， d表示数据的特征数。为了验证FAGPP算法的有效性，在6个图像数据集上与5个相关的对比算法进行了大量的聚类实验，实验结果证明FAGPP算法能快速地对数据进行降维，降维的数据在投影子空间中具有较好的类簇结构，获得良好的聚类效果。
英文摘要	With the development of economy and society, data shows explosive growth, and this growth is mainly reflected in the number and dimension of data. The samples of high-dimensional data are sparse and contain a lot of redundant features and noise information, and their meaningful class cluster structures are often embedded in low-dimensional subspaces. Dimensionality reduction of high dimensional data is a key step in improving the performance of clustering algorithms. The linear projection technique is a commonly used dimensionality reduction method that transforms the high-dimensional data into a low-dimensional subspace through a linear transformation. Traditional clustering algorithms have the following two main problems when clustering high-dimensional data: (1) they are sensitive to abnormal samples and lack the process of identifying and dealing with abnormal samples; (2) they need to reduce the high-dimensional data beforehand, and the time complexity of the algorithms is usually high, which is not applicable to high-dimensional large data sets. To address these problems, two algorithms are proposed in this paper. (1) A Projected Fuzzy c-means Clustering Algorithm with Instance Penalty (PCIP) is proposed. The PCIP algorithm unifies the clustering task and the dimensionality reduction task into a single objective function, which iteratively learns the linear projection matrix and the membership matrix, and performs dimensionality reduction and clustering simultaneously. In addition, the PCIP algorithm constructs an instance penalty matrix based on the distribution of the original data and assigns an instance penalty coefficient to each sample to reduce the influence of noisy samples on the model. The PCIP algorithm also reduces the influence of noisy samples on the model by merging Fuzzy c-means Clustering (FCM) and Principal Component Analysis (PCA) into a single objective function for clustering and dimensionality reduction. The PCIP algorithm also constructs a novel clustering model for high-dimensional data by fusing Fuzzy c-means Clustering (FCM) and Principal Component Analysis (PCA). The time complexity of the PCIP algorithm is linearly correlated with the number of samples, and it can efficiently handle large data sets. In order to verify the effectiveness of the PCIP algorithm, a large number of clustering experiments are conducted on 10 image datasets with 7 related comparison algorithms, the experimental results demonstrate that the PCIP algorithm is able to learn the projection matrix and the affiliation matrix quickly and obtain good clustering results on the projection space. (2) Fast Anchor Graph Preserving Projections (FAGPP) algorithm is proposed. In order to reduce the time complexity of the graph-based dimensionality reduction algorithm, the FAGPP algorithm uses the learning of anchor graphs instead of neighbor graphs, and at the same time, the anchor graphs in the original data space are used to guide the learning of the anchor graphs in the projection space. The FAGPP algorithm also incorporates the PCA model, making it capable of handling the clustering information and the global information of the data. The time complexity of FAGPP is O(nmd), where n denotes the number of samples, m denotes the number of anchors, and d denotes the number of features of the data. In order to verify the effectiveness of the FAGPP algorithm, a large number of clustering experiments have been carried out on six image data sets with five related comparison algorithms, and the experimental results prove that the FAGPP algorithm can quickly downsize the data, and the downsized data has a better class-cluster structure in the shadow casting space, and good clustering results are obtained.
学位类型	硕士
答辩日期	2024-05-18
学位授予地点	甘肃省兰州市
语种	中文
论文总页数	80
参考文献总数	80
馆藏号	0006294
保密级别	公开
中图分类号	C93/98
文献类型	学位论文
条目标识符	http://ir.lzufe.edu.cn/handle/39EH0E1M/36502
专题	信息工程与人工智能学院
推荐引用方式 GB/T 7714	吴义稳. 线性投影的高维数据聚类算法研究[D]. 甘肃省兰州市. 兰州财经大学,2024.

条目包含的文件		下载所有文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
2021000010014-吴义稳.pd（3715KB）	学位论文		开放获取	CC BY-NC-SA	浏览下载