作者李冰
姓名汉语拼音libing
学号2020000010006
培养单位兰州财经大学
电话15735044227
电子邮件libing9704@163.com
入学年份2020-9
学位类别学术硕士
培养级别硕士研究生
学科门类管理学
一级学科名称管理科学与工程
学科方向
学科代码1201
第一导师姓名聂飞平
第一导师姓名汉语拼音niefeiping
第一导师单位西北工业大学
第一导师职称教授
题名数据编辑的Self-training分类算法研究
英文题名Research on Self-training classification algorithm with data editing
关键词半监督 自训练 数据编辑 分类算法
外文关键词Semi-supervision ; Self-training ; Data editing ; Classification algorithm
摘要

随着社会经济的发展,数据集的规模越来越大,其中只有少量数据有标签,而数据的标注费时且代价高昂。半监督分类算法可以使用少量有标签样本和大量无标签样本进行学习。自训练作为一种经典的半监督学习框架成为了研究的热点,但自训练算法的性能主要依赖于高置信度样本点的选取,在迭代过程中一旦出现噪声样本将会极大程度地影响算法的分类性能,为了处理数据中的噪声或误标记样本,研究人员提出了许多基于数据编辑的半监督分类算法,然而大多数编辑算法大多使用欧式距离计算样本间距离,且时间复杂度均不低于O(n2),不适用于大规模高维数据集。

综上所述,现有的半监督自训练算法存在两个问题:一是现有的自训练算法在选择高置信度样本点时缺乏对噪声样本的处理且时间复杂度高,二是欧式距离在高维数据集上容易出现维度诅咒,针对这两个问题,本文提出了两个算法。

一、提出了快速球簇划分编辑的半监督Self-training算法(EBSA)。EBSA将数据集划分为稳定区域和争议区域,在此基础上提出了球簇划分编辑算法,用来识别稳定区域内的误标记样本点,并对其进行编辑,提升了高置信度样本点选取质量。EBSA在每次迭代中仅需计算样本点与球簇中心的距离,计算量小,速度快。实验结果表明,与对比算法相比,EBSA算法不仅运行速度快且提升了算法性能。

二、提出了块估计近邻编辑Self-training算法MDSFMDSF使用不相似性度量方法计算样本之间的距离,并定义了块估计近邻关系,进而构建块估计近邻图,在此基础上提出了块估计近邻编辑算法对数据进行编辑,提升了高置信度样本的选取质量。因为使用不相似性度量方法,所以在高维数据集上算法性能较好。大量实验结果表明,与同类算法相比, MDSF在高维数据集上性能明显优于对比算法。

英文摘要

With the development of social economy, the scale of datasets is becoming larger and larger, with only a small amount of data labeled, and data annotation is time-consuming and expensive. Semi-supervised classification algorithms can use a small number of labeled samples and a large number of unlabeled samples for learning. Self-training, as a classic semi-supervised learning framework, has become a research hotspot, but the performance of the self-training algorithm mainly depends on the selection of high confidence sample points. Once noise samples appear in the iterative process, the classification performance of the algorithm will be greatly affected. In order to deal with noise or mislabeled samples in data, researchers have proposed many semi-supervised classification algorithms based on data editing. However, self-training algorithms often use Euclidean distance to calculate the distance between samples, and the time complexity of most editing algorithms is no less than O(n2) and are unsuitable for large-scale high-dimensional datasets.

 In summary, existing semi-supervised self-training algorithms have two problems: firstly, they lack the processing of noisy samples and have high time complexity when selecting high-confidence sample points. Secondly, Euclidean distance is prone to dimensional curse on high-dimensional datasets. In response to these two problems, this paper proposes two algorithms.

 (1) A semi-supervised self-training algorithm (EBSA) for fast ball cluster partitioning and editing is proposed. EBSA divides the dataset into stable regions and controversial regions. Based on this, a ball cluster partitioning and editing algorithm is proposed to identify and edit mislabeled sample points in stable regions, improving the quality of sample selection with high confidence. In each iteration, EBSA only needs to calculate the distance between the sample point and the center of the ball cluster, which requires less computation and is fast. Experimental results show that compared with the comparison algorithm, the EBSA algorithm not only runs faster but also improves the performance of the algorithm.

(2) A block estimation nearest neighbor editing self-training algorithm (MDSF) is proposed. MDSF uses a dissimilarity metric method to calculate the distance between samples, and defines a block estimation neighborhood relationship. Then, it constructs a block estimation neighborhood graph. Based on this, it proposes a block estimation neighborhood editing algorithm to edit the data, improving the quality of selecting high-confidence samples. The algorithm performs better on high-dimensional datasets because of the use of similarity measures. A large number of experimental results show that compared with similar algorithms, MDSF performs significantly better on high-dimensional datasets than the comparison algorithm.

学位类型硕士
答辩日期2023-05-20
学位授予地点甘肃省兰州市
语种中文
论文总页数80
参考文献总数84
馆藏号0004968
保密级别公开
中图分类号C93/77
文献类型学位论文
条目标识符http://ir.lzufe.edu.cn/handle/39EH0E1M/34306
专题信息工程与人工智能学院
推荐引用方式
GB/T 7714
李冰. 数据编辑的Self-training分类算法研究[D]. 甘肃省兰州市. 兰州财经大学,2023.
条目包含的文件 下载所有文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
2020000010006.pdf(3852KB)学位论文 开放获取CC BY-NC-SA浏览 下载
个性服务
查看访问统计
谷歌学术
谷歌学术中相似的文章
[李冰]的文章
百度学术
百度学术中相似的文章
[李冰]的文章
必应学术
必应学术中相似的文章
[李冰]的文章
相关权益政策
暂无数据
收藏/分享
文件名: 2020000010006.pdf
格式: Adobe PDF
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。