Institutional Repository of School of Information Engineering and Artificial Intelligence
作者 | 吴义稳 |
姓名汉语拼音 | Wu Yiwen |
学号 | 2021000010014 |
培养单位 | 兰州财经大学 |
电话 | 17826834140 |
电子邮件 | 2516482760@qq.com |
入学年份 | 2021-9 |
学位类别 | 学术硕士 |
培养级别 | 硕士研究生 |
学科门类 | 管理学 |
一级学科名称 | 管理科学与工程 |
学科方向 | 无 |
学科代码 | 1201 |
授予学位 | 管理学硕士 |
第一导师姓名 | 聂飞平 |
第一导师姓名汉语拼音 | Nie Feiping |
第一导师单位 | 兰州财经大学 |
第一导师职称 | 教授 |
题名 | 线性投影的高维数据聚类算法研究 |
英文题名 | A Study on Clustering Algorithms for High-dimensional Data with Linear Projection |
关键词 | 降维 聚类 锚点 线性投影 实例惩罚 |
外文关键词 | Dimensionality reduction ; Clustering ; Anchor ; Linear projection ; Instance penalty |
摘要 | 随着经济社会的发展,数据呈现爆炸性增长,这种增长主要体现在数据的数量和维度。高维数据的样本稀疏且包含大量的冗余特征和噪声信息,其有意义的类簇结构往往嵌入在低维子空间中。对高维数据进行降维是提高聚类算法性能的一个关键步骤。线性投影技术是常用的降维方法,其通过线性变换将高维数据变换到低维子空间中。传统的聚类算法在对高维数据进行聚类时主要存在以下两个问题:(1)对异常样本敏感,缺乏识别和处理异常样本的过程;(2)需要预先对高维数据进行降维处理,算法的时间复杂度通常较高,不适用于高维的大数据集。针对这些问题,本文提出了两个算法。 |
英文摘要 | With the development of economy and society, data shows explosive growth, and this growth is mainly reflected in the number and dimension of data. The samples of high-dimensional data are sparse and contain a lot of redundant features and noise information, and their meaningful class cluster structures are often embedded in low-dimensional subspaces. Dimensionality reduction of high dimensional data is a key step in improving the performance of clustering algorithms. The linear projection technique is a commonly used dimensionality reduction method that transforms the high-dimensional data into a low-dimensional subspace through a linear transformation. Traditional clustering algorithms have the following two main problems when clustering high-dimensional data: (1) they are sensitive to abnormal samples and lack the process of identifying and dealing with abnormal samples; (2) they need to reduce the high-dimensional data beforehand, and the time complexity of the algorithms is usually high, which is not applicable to high-dimensional large data sets. To address these problems, two algorithms are proposed in this paper. (1) A Projected Fuzzy c-means Clustering Algorithm with Instance Penalty (PCIP) is proposed. The PCIP algorithm unifies the clustering task and the dimensionality reduction task into a single objective function, which iteratively learns the linear projection matrix and the membership matrix, and performs dimensionality reduction and clustering simultaneously. In addition, the PCIP algorithm constructs an instance penalty matrix based on the distribution of the original data and assigns an instance penalty coefficient to each sample to reduce the influence of noisy samples on the model. The PCIP algorithm also reduces the influence of noisy samples on the model by merging Fuzzy c-means Clustering (FCM) and Principal Component Analysis (PCA) into a single objective function for clustering and dimensionality reduction. The PCIP algorithm also constructs a novel clustering model for high-dimensional data by fusing Fuzzy c-means Clustering (FCM) and Principal Component Analysis (PCA). The time complexity of the PCIP algorithm is linearly correlated with the number of samples, and it can efficiently handle large data sets. In order to verify the effectiveness of the PCIP algorithm, a large number of clustering experiments are conducted on 10 image datasets with 7 related comparison algorithms, the experimental results demonstrate that the PCIP algorithm is able to learn the projection matrix and the affiliation matrix quickly and obtain good clustering results on the projection space. (2) Fast Anchor Graph Preserving Projections (FAGPP) algorithm is proposed. In order to reduce the time complexity of the graph-based dimensionality reduction algorithm, the FAGPP algorithm uses the learning of anchor graphs instead of neighbor graphs, and at the same time, the anchor graphs in the original data space are used to guide the learning of the anchor graphs in the projection space. The FAGPP algorithm also incorporates the PCA model, making it capable of handling the clustering information and the global information of the data. The time complexity of FAGPP is O(nmd), where n denotes the number of samples, m denotes the number of anchors, and d denotes the number of features of the data. In order to verify the effectiveness of the FAGPP algorithm, a large number of clustering experiments have been carried out on six image data sets with five related comparison algorithms, and the experimental results prove that the FAGPP algorithm can quickly downsize the data, and the downsized data has a better class-cluster structure in the shadow casting space, and good clustering results are obtained. |
学位类型 | 硕士 |
答辩日期 | 2024-05-18 |
学位授予地点 | 甘肃省兰州市 |
语种 | 中文 |
论文总页数 | 80 |
参考文献总数 | 80 |
馆藏号 | 0006294 |
保密级别 | 公开 |
中图分类号 | C93/98 |
文献类型 | 学位论文 |
条目标识符 | http://ir.lzufe.edu.cn/handle/39EH0E1M/36502 |
专题 | 信息工程与人工智能学院 |
推荐引用方式 GB/T 7714 | 吴义稳. 线性投影的高维数据聚类算法研究[D]. 甘肃省兰州市. 兰州财经大学,2024. |
条目包含的文件 | 下载所有文件 | |||||
文件名称/大小 | 文献类型 | 版本类型 | 开放类型 | 使用许可 | ||
2021000010014-吴义稳.pd(3715KB) | 学位论文 | 开放获取 | CC BY-NC-SA | 浏览 下载 |
个性服务 |
查看访问统计 |
谷歌学术 |
谷歌学术中相似的文章 |
[吴义稳]的文章 |
百度学术 |
百度学术中相似的文章 |
[吴义稳]的文章 |
必应学术 |
必应学术中相似的文章 |
[吴义稳]的文章 |
相关权益政策 |
暂无数据 |
收藏/分享 |
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。
修改评论