Institutional Repository of School of Information Engineering and Artificial Intelligence
作者 | 段会雨 |
姓名汉语拼音 | DuanHuiYu |
学号 | 2021000010001 |
培养单位 | 兰州财经大学 |
电话 | 15225152794 |
电子邮件 | duanhuiyu1997@163.com |
入学年份 | 2021-9 |
学位类别 | 学术硕士 |
培养级别 | 硕士研究生 |
学科门类 | 管理学 |
一级学科名称 | 管理科学与工程 |
学科方向 | 无 |
学科代码 | 1201 |
第一导师姓名 | 李兵 |
第一导师姓名汉语拼音 | libing |
第一导师单位 | 兰州财经大学 |
第一导师职称 | 教授 |
第二导师姓名 | 王继奎 |
第二导师姓名汉语拼音 | wangjikui |
第二导师单位 | 兰州财经大学 |
第二导师职称 | 教授 |
题名 | 鲁棒的高置信度样本选择方法在Self-training算法中的研究 |
英文题名 | Robust High-Confidence Sample Selection Method to Self-training Algorithms |
关键词 | self-training 高置信度样板 局部邻居 全局信息 数据编辑 |
外文关键词 | self-training ; High-confidence samples ; Local Neighbors ; Global Information ; Data Editing |
摘要 | Self-training是半监督学习众所周知的一个的框架,其利用少量的有标签数据和大量的无标签数据训练分类器。如何从大量的无标签样本中筛选出高置信度样本加入训练集是self-training算法关键的一步,在迭代训练过程中如果使用误分类的高置信度样本,分类错误将被放大。因此,我们提出两种提升高置信度样本选取的self-training算法: (1)基于对self-training算法的研究,我们发现许多现有的方法都是基于欧几里得距离计算样本间的距离,这不适用于分布复杂的数据。此外,很多现有算法不能充分利用数据的空间结构挖掘重要信息。因此,我们提出了鲁棒的近亲节点图编辑self-training算法(A Robust Self-training Algorithm based on Relative Node Graph,STRNG)。首先,STRNG使用块估计算每个样本的密度和峰值,以构建原型树揭示数据的潜在空间结构。然后,我们开始构建每个样本的近亲节点图(RNG)。最后,采用假设检验方法去除高置信度样本中可能标记错误的高置信度样本。我们在14个公开真实数据集上与4个已有Self-training算法进行对比,实验结果验证我们所提出的算法的有效性。 (2)针对现有的一些self-training基于局部信息来选择高置信度样本,以及参数依赖的问题。我们提出了一个基于全局信息过滤的self-training算法(A Self-training Algorithm for Adaptive Local Neighbor Filtering, STALN)。首先,STALN结合数据的全局信息与数据的局部信息,自适应地选择无标签样本的局部邻居。其次,利用局部邻居的信息为无标签样本分配标签,并与基分类器预测的标签对比,如果标签一致则将其视为高置信度样本,此方法可以有效识别误标记样本。最后,将这些高置信度样本添加进训练集进行迭代训练。为了验证STALN的性能,在18个基准数据集与4个已有Self-training算法进行对比,实验结果验证了STALN的有效性。 |
英文摘要 | Self-training is a well-known framework for semi-supervised learning, which utilizes a small amount of labeled data and a large amount of unlabeled data to train classifiers. How to select high-confidence samples from a large number of unlabeled samples and add them to the training set is a crucial step for self-training algorithms. In addition, if high-confidence samples with misclassification are used during the iterative training process, classification errors will be amplified. Therefore, we propose two self-training algorithms to improve the selection of high-confidence samples: (1) Based on research on self training algorithms, many existing self training methods are based on Euclidean distance to measure the relationships between samples, which is not suitable for complex structured datasets. In addition, many existing algorithms cannot fully utilize the spatial structure of the dataset to mine important information. Therefore, a robust self training algorithm (A Robust Self training Algorithm based on Relative Node Graph, STRNG) was proposed. Firstly, STRNG uses block estimation to calculate the density and peak of each sample, in order to construct a prototype tree to reveal the potential spatial structure of the data. Then, we begin to construct a relative node graph (RNG) for each sample. Finally, based on hypothesis testing methods, samples with high confidence that may be labeled incorrectly are removed. The effectiveness of the proposed algorithm was verified by comparing it with four existing algorithms on 14 publicly available datasets. (2) A self-training algorithm based on global information filtering (A Self training Algorithm for Adaptive Local Neighbor Filtering, STALN) is proposed. Firstly, STALN combines the global and local information of the data to adaptively select local neighbors for unlabeled samples. Secondly, using the information of local neighbors to assign labels to unlabeled samples, and comparing them with the labels predicted by the base classifier. If the labels are consistent, they are considered as high confidence samples. Finally, add these high- confidence samples to the training set for iterative training. The effectiveness of STALN was validated by comparing it with four existing algorithms on 18 publicly available datasets. |
学位类型 | 硕士 |
答辩日期 | 2024-05-20 |
学位授予地点 | 甘肃省兰州市 |
语种 | 中文 |
论文总页数 | 87 |
参考文献总数 | 73 |
馆藏号 | 0006282 |
保密级别 | 公开 |
中图分类号 | C93/86 |
文献类型 | 学位论文 |
条目标识符 | http://ir.lzufe.edu.cn/handle/39EH0E1M/36762 |
专题 | 信息工程与人工智能学院 |
推荐引用方式 GB/T 7714 | 段会雨. 鲁棒的高置信度样本选择方法在Self-training算法中的研究[D]. 甘肃省兰州市. 兰州财经大学,2024. |
条目包含的文件 | 下载所有文件 | |||||
文件名称/大小 | 文献类型 | 版本类型 | 开放类型 | 使用许可 | ||
2021000010001.pdf(4207KB) | 学位论文 | 开放获取 | CC BY-NC-SA | 浏览 下载 |
个性服务 |
查看访问统计 |
谷歌学术 |
谷歌学术中相似的文章 |
[段会雨]的文章 |
百度学术 |
百度学术中相似的文章 |
[段会雨]的文章 |
必应学术 |
必应学术中相似的文章 |
[段会雨]的文章 |
相关权益政策 |
暂无数据 |
收藏/分享 |
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。
修改评论