Institutional Repository of School of Statistics
作者 | 李唯欣 |
姓名汉语拼音 | LiWeixin |
学号 | 2021000003015 |
培养单位 | 兰州财经大学 |
电话 | 15714500183 |
电子邮件 | 748622180@qq.com |
入学年份 | 2021-9 |
学位类别 | 专业硕士 |
培养级别 | 硕士研究生 |
一级学科名称 | 统计学 |
学科代码 | 0252 |
授予学位 | 应用统计硕士专业学位 |
第一导师姓名 | 高海燕 |
第一导师姓名汉语拼音 | GaoHaiyan |
第一导师单位 | 兰州财经大学 |
第一导师职称 | 教授 |
题名 | 基于多重插补的稀疏函数型数据修复方法研究 |
英文题名 | Research on Sparse Functional Data Recovery Method Based on Multiple Imputation |
关键词 | 函数型数据 缺失森林 多重插补 聚类 高斯过程 |
外文关键词 | Functional data ; Missforest ; Multiple imputation ; Clustering ; Gaussian process |
摘要 | 大数据时代,随着科学技术的进步和数据收集储存能力的提升,数据结构变 得复杂、形式变得多样。传统的结构化数据已经从简单的点数据扩展到区间数据、 符号数据和函数型数据等。函数型数据是一类复杂的非线性结构数据,往往以函 数(曲线)的形式呈现和储存。由于数据收集过程中,经常会出现数据缺失的情况, 因此,针对缺失数据插补方法的研究成为国内外学者关注的重点。然而现有的传 统插补方法并不适用于函数型数据,在数据修复过程中并没有考虑函数型数据的 潜在信息。为了解决上述问题,本文首先引入类信息挖掘数据之间的相关性,提 出一种融合类信息的函数型多重插补方法(Missforest Combining Class Information and PACE,CMFP)。同时,整合数据的横截面信息和纵向信息来推测缺失数据, 提出一种基于横截面和纵向信息的函数型多重插补方法(Missforest Combining Gaussian Processes,MFGP)。本文的主要研究内容包括以下两部分: (1)提出一种融合类信息的函数型多重插补方法(CMFP)。在函数型数据分 析框架下,以缺失森林模型 MF为基础,采用基于条件期望主成分分析的函数型 插补方法PACE进行初始插补,并通过K-means聚类借助样本之间的相关性,给 出了一种融合类信息的函数型多重插补方法。模拟数据插补实验结果表明,在不 同缺失比例(5%~55%)下,该方法相较于Hot.deck、均值插补、MF、PACE等7种 插补方法,能够保证插补的准确性和有效性。同时,针对股票数据的实例应用验 证了该方法插补得到的数据符合实际情况和规律。 (2)提出一种基于横截面和纵向信息的函数型多重插补方法(MFGP)。将基 于缺失森林模型MF的插补与基于高斯过程GP的预测相结合,有效整合函数型 数据的横截面和纵向信息,进而提高插补精度。首先,应用MF对平面数据进行 横截面插补。其次,利用GP进行纵向插补。然后,通过计算误差对插补结果进 行加权结合。最后,模拟数据插补实验和股票数据实例分析结果表明:在不同缺 失比例(5%~55%)下,相较于 Hot.deck、均值插补、MF、GP 等 7 种插补方法, MFGP方法具有显著的插补优势,插补精度高。 |
英文摘要 | In the era of big data, with the progress of science and technology and the improvement of data collection and storage capacity, the data structure has become complex and diverse. Traditional structured data has expanded from simple point data to interval data, symbolic data and functional data. Functional data is a kind of complex nonlinear structural data, which is often presented and stored in the form of functions (curves). Because data is often missing in the process of data collection, the research on imputation methods for missing data has become the focus of domestic and foreign scholars. However, the existing traditional imputation methods are not suitable for functional data, and the potential information of functional data is not considered in the process of data restoration. In order to solve the above problems, this thesis first introduces class information to mine correlation between sample data, and proposes a functional multiple imputation method based on class information (CMFP). At the same time, the missing data is inferred by integrating the cross-sectional information and longitudinal information of data, and a functional multiple imputation method (MFGP) based on cross-sectional and longitudinal information is proposed. The main research contents of this thesis include the following two parts: (1) A functional multiple imputation method (CMFP) based on class information is proposed. Under the framework of functional data analysis,based on the missforest model MF, a functional imputation method PACE based on conditional expectation principal component analysis is used for initial imputation, and a functional multiple imputation method integrating class information is given by K-means clustering with the help of correlation between samples. The experimental results of simulated data imputation show that this method can ensure the accuracy and effectiveness of imputation compared with seven imputation methods, such as Hot.deck, mean imputation, MF and PACE, under different missing ratios(5%~55%). At the same time, the application of stock data proves that the data imputed by this method accords with the actual situation and laws. (2) A functional multiple imputation method (MFGP) based on cross sectional and longitudinal information is proposed. Combining the imputation based on missforest method MF with the prediction based on Gaussian process GP, the cross-sectional and longitudinal information of functional data can be effectively integrated, and then the imputation accuracy can be improved. Firstly, MF is used to impute the cross section of plane data. Secondly, longitudinal imputation is carried out by using GP. Then, the imputation results are weighted and combined by calculation error. Finally, the simulation data imputation experiment and the analysis of stock data examples show that, under different missing ratios (5%~55%), compared with seven imputation methods such as Hot.deck, mean imputation, MF, HFI and GP, the MFGP method has significant imputation advantages and high imputation accuracy. |
学位类型 | 硕士 |
答辩日期 | 2024-05-25 |
学位授予地点 | 甘肃省兰州市 |
语种 | 中文 |
论文总页数 | 66 |
参考文献总数 | 55 |
馆藏号 | 0005616 |
保密级别 | 公开 |
中图分类号 | C8/392 |
文献类型 | 学位论文 |
条目标识符 | http://ir.lzufe.edu.cn/handle/39EH0E1M/36761 |
专题 | 统计与数据科学学院 |
推荐引用方式 GB/T 7714 | 李唯欣. 基于多重插补的稀疏函数型数据修复方法研究[D]. 甘肃省兰州市. 兰州财经大学,2024. |
条目包含的文件 | 下载所有文件 | |||||
文件名称/大小 | 文献类型 | 版本类型 | 开放类型 | 使用许可 | ||
2021000003015.pdf(3013KB) | 学位论文 | 开放获取 | CC BY-NC-SA | 浏览 下载 |
个性服务 |
查看访问统计 |
谷歌学术 |
谷歌学术中相似的文章 |
[李唯欣]的文章 |
百度学术 |
百度学术中相似的文章 |
[李唯欣]的文章 |
必应学术 |
必应学术中相似的文章 |
[李唯欣]的文章 |
相关权益政策 |
暂无数据 |
收藏/分享 |
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。
修改评论