基于抽样的高维矩阵低秩逼近及应用研究

作者	任潇潇
姓名汉语拼音	Ren Xiaoxiao
学号	2019000003051
培养单位	兰州财经大学
电话	17693175027
电子邮件	rx1585298@163.com
入学年份	2019-9
学位类别	专业硕士
培养级别	硕士研究生
一级学科名称	应用统计
学科代码	0252
第一导师姓名	牛成英
第一导师姓名汉语拼音	Niu Chengying
第一导师单位	兰州财经大学
第一导师职称	教授
题名	基于抽样的高维矩阵低秩逼近及应用研究
英文题名	Low rank approximation of high dimensional matrix based on sampling and its application
关键词	Nyström方法 CUR矩阵分解方法不等概抽样不等概自适应抽样随机SVD分解相对误差计算复杂度
外文关键词	Nyström method ; Relative error ; Computational complexity ; CUR matrix decomposition method ; Unequal probability sampling ; Unequal probability adaptive sampling ; Random SVD decomposition
摘要	大数据时代，海量数据大多数以高维矩阵形式存在，如何对高维矩阵进行降维成为机器学习的研究热点问题。利用抽样技术降低高维数据的维度和计算复杂度已被证明是一种有效手段，但不同的抽样和矩阵重构方法在降维过程中产生的误差存在较大差异。本文从抽样的角度出发，研究高维矩阵低秩逼近的方法与误差测度，关注在提高低秩逼近精度的同时，能够降低计算复杂度。主要工作包括以下几方面：首先，对于大规模数据集，Nyström方法是一种较为有效的矩阵低秩逼近技术，旨在从原始数据矩阵中抽取部分列重构原始数据矩阵的低秩逼近矩阵。考虑到不同抽样方法对重构矩阵的精度有较大的影响，提出将不等概抽样Nyström方法与随机奇异值分解(SVD)方法相结合，进而在矩阵重构过程中提高矩阵低秩逼近精度，并有效降低计算复杂度。研究结果表明，提出的Nyström方法在矩阵重构中具有较高的精确度，且可以极大的降低计算复杂度。其次，高维大数据矩阵分析中，使用少量主要成分逼近原始数据矩阵是常用方法，这些主要成分是矩阵行和列的线性组合，不易对数据的原始特征进行解释。提出将不等概抽样与自适应抽样结合的适用于CUR矩阵分解的抽样方法，并将该抽样方法与矩阵随机奇异值分解(SVD)方法相结合，对抽样得到的子矩阵C和R进行随机SVD分解，在控制计算复杂度的同时提高矩阵低秩逼近重构的精度。研究结果表明，基于不等概自适应抽样和随机SVD分解相结合的CUR矩阵分解方法在矩阵低秩逼近中具有较高的精确度和稳定性。最后，将基于不等概抽样和随机SVD分解Nyström方法拓展运用于谱聚类，利用上市公司股票财务比率数据进行实证分析。提出基于不等概抽样Nyström特征提取方法，通过提取影响上市公司业绩的主要特征指标，在降低数据维度和数据计算复杂度的同时最大可能保留原始数据信息，并在选取特征变量的基础上对上市公司进行谱聚类分析。研究结果表明，按抽样比例为20%对原数据指标进行特征提取，可以均匀包含原数据10大类一级指标，表示特征提取的结果具有较好的代表性。谱聚类结果分析可见，将选取的73家上市公司分为4类，通过聚类效果评价准则，得到表示聚类效果的值R²=0.72 ，表明此次聚类具有良好的效果。将基于不等概抽样与随机SVD分解的CUR矩阵分解拓展运用于偏好特征提取，该偏好特征提取方法基于原始数据抽样，数据可解释性较高，意义明确。利用用户-电影评分数据进行实证检验，研究结果表明，利用CUR矩阵进行偏好特征提取算法性能较好，提取的用户或产品的特征能较好地反映原始数据特征；且随着抽样提取的列数和行数的增加，偏好特征提取的准确率呈上升趋势，压缩率呈下降趋势；将基于CUR矩阵分解的偏好特征提取方法与基于SVD分解的偏好特征提取方法相比，前者的准确度远远高于后者。
英文摘要	In the era of big data, most of the massive data exist in the form of high-dimensional matrix. How to reduce the dimension of high-dimensional matrix has become a hot topic in machine learning. Sampling technique has been proved to be an effective method to reduce the dimension and computational complexity of high-dimensional data, but the errors generated by different sampling and matrix reconstruction methods are quite different in the process of dimensionality reduction. From the sampling point of view, this paper studies the method and error measure of low-rank approximation of high-dimensional matrix, focusing on improving the accuracy of low-rank approximation while reducing the computational complexity. The main work includes the following aspects: Firstly, Nyström method is a relatively effective low-rank approximation technique for large-scale data sets, which aims to extract some columns from the original data matrix to reconstruct the low-rank approximation matrix of the original data matrix. Considering that different sampling methods have great influence on the accuracy of matrix reconstruction, a combination of unequal probability sampling Nyström method and stochastic singular value decomposition (SVD) method is proposed to improve the low-rank approximation accuracy and reduce the computational complexity in matrix reconstruction. The results show that the proposed Nyström method has high accuracy in matrix reconstruction and can greatly reduce the computational complexity. Secondly, in high-dimensional big data matrix analysis, it is a common method to approximate the original data matrix with a small number of major components. These major components are linear combinations of matrix rows and columns, and it is difficult to explain the original characteristics of the data. Proposed to differ is sampling and adaptive sampling is suitable for the CUR sampling method of matrix decomposition, and the random sampling method and matrix singular value decomposition (SVD) method, combining the matrix C and R obtained by sampling randomly SVD decomposition, in the control of computational complexity and improve the accuracy of low rank approximation reconstruction. The results show that the CUR matrix decomposition method based on the combination of unequal probability adaptive sampling and stochastic SVD decomposition has high accuracy and stability in low-rank approximation of matrices. Finally, the Nyström method based on unequal probability sampling and random SVD decomposition is extended to spectral clustering, and empirical analysis is made by using the financial ratio data of listed companies.A feature extraction method based on the Nyström method of unequal-probability sampling is proposed. By extracting the main feature indexes that affect the performance of listed companies, the original data information can be retained as much as possible while reducing the data dimension and the complexity of data calculation. On the basis of selecting feature variables, spectral clustering analysis is carried out for listed companies. The results show that the sample ratio of 20% for feature extraction of the original data index can uniformly include 10 categories of first-level indicators of the original data, indicating that the results of feature extraction have good representativeness. The analysis of spectral clustering results shows that the 73 listed companies selected in this paper are divided into 4 categories, and the value R²=0.72 representing the clustering effect is obtained through the evaluation criteria of clustering effect, indicating that the clustering has a good effect. The CUR matrix decomposition based on unequal probability sampling and random SVD decomposition is extended to preference feature extraction, and empirical test is performed using user-movie rating data. The preference feature extraction method is based on raw data sampling, which has high explanatory value and clear meaning. The results show that the preference feature extraction algorithm based on CUR matrix has better performance, and the extracted user or product features can reflect the original data features well. With the increase of the number of sampling columns and rows, the accuracy rate of preference feature extraction increases and the compression rate decreases. The accuracy of preference feature extraction method based on CUR matrix decomposition is much higher than that based on SVD decomposition.
学位类型	硕士
答辩日期	2022-05-15
学位授予地点	甘肃省兰州市
语种	中文
论文总页数	79
参考文献总数	90
馆藏号	0004310
保密级别	公开
中图分类号	C8/315
文献类型	学位论文
条目标识符	http://ir.lzufe.edu.cn/handle/39EH0E1M/32493
专题	统计与数据科学学院
推荐引用方式 GB/T 7714	任潇潇. 基于抽样的高维矩阵低秩逼近及应用研究[D]. 甘肃省兰州市. 兰州财经大学,2022.

条目包含的文件		下载所有文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
2019000003051.pdf（9893KB）	学位论文		开放获取	CC BY-NC-SA	浏览下载