基于改进BalanceCascade方法的信用评分集成模型研究

作者	刘伟
姓名汉语拼音	Liu Wei
学号	2019000010008
培养单位	兰州财经大学
电话	17839194542
电子邮件	2637619807@qq.com
入学年份	2019-9
学位类别	学术硕士
培养级别	硕士研究生
学科门类	管理学
一级学科名称	管理科学与工程
学科方向	无
学科代码	1201
授予学位	管理学硕士学位
第一导师姓名	韩金仓
第一导师姓名汉语拼音	Han Jincang
第一导师单位	兰州财经大学
第一导师职称	教授
题名	基于改进BalanceCascade方法的信用评分集成模型研究
英文题名	Research on credit scoring ensemble model based on improved BalanceCascade method
关键词	信用评分单一模型集成模型不平衡处理
外文关键词	Credit scoring; Single model; Ensemble model; Imbalance processing
摘要	近年来随着消费金融的迅猛发展，个人信贷业务也快速发展起来，不仅各种网贷平台增多，而且贷款品种也逐渐丰富，几乎涵盖了个人生产和生活的方方面面。然而来自信用风险的挑战也日趋严峻，通过对申请人信用评分进行风险评估显得尤为重要。目前，虽然有较多的信用评分模型，但不同模型各有优劣。以往的研究发现单一模型训练速度较快，但预测精度与稳定性差；若能够选取合适的基分类器进行集成，可在一定程度上降低预测误差，提高准确性；并且在实际过程中，由于信用评分数据集自身局限性，正负样本类别差异极大，不平衡问题的处理对模型性能也有重要影响。基于以上问题，本文进行了如下研究：本文利用随机森林方法进行特征选择，该方法在拟合数据后，能够对所有特征属性进行重要性度量，相较于金融风控中常用的信息值的特征选择方法避免了对每个特征的分箱操作，可直接获得特征重要性排序，实现更为简单，选择特征的速度更加高效；根据特性重要性排名与业务逻辑，最后选择重要性大于0.1的特征，一共选取27个特征作为入模变量。为检验不同类型模型在实际中的应用情况，选取逻辑回归（LR）、决策树（DT）、朴素贝叶斯（NB）与支持向量机（SVM）四种在信用评分分类预测性能较好、认可度较高的单一模型进行实验；之后分别以LR、NB、DT、SVM四种单一模型分别为基分类器进行Bagging集成，检验同质集成模型的分类性能；根据不同的基分类器进行集成可以相互补充，提高信用评分模型分类预测的精度与准确性，为检验实际分类效果，以LR、NB、DT、SVM四种性能较好的分类算法为基分类器，通过bootstrap进行抽样构建数据子集自适应投票选择AUC最高的基分类器进行集成，构建一种新的异质集成模型进行实验。针对信用评分数据集中正负样本类别不平衡性问题，提出了一种改进的BalanceCascade方法，该方法通过抽取正类样本与负类样本构成平衡数据集训练Adaboost分类器，将分类错误率控制在一定范围内，确保移除正类样本的准确性；之后根据正负样本的不平衡比例，设置一个可调参数，通过不断移除一定比例的正样本，使得剩余正负样本比例接近此参数，对不同正负样本比例下的数据集进行实验，结合新的分层模型进行训练，寻找最优的比例参数。由于RF与XGBoost在信用评分中准确性方面具有的较大优势，所以选择RF与XGBoost作为第一层的基分类器，而第二层模型不应太复杂，太过复杂的话可能会导致模型在训练集上过拟合、泛化效果差等问题，所以该层模型选用较为稳定的单一模型逻辑回归为基分类器，通过在阿里天池竞赛上的信用数据集实验结果显示，当正负样本比例设置为2时，基于改进BalanceCascade方法的信用评分集成模型准确率达到0.80，精确率0.90，召回率为0.84，F1值为0.88，AUC值0.74，相较于单一分类模型、Bagging集成模型、自适应选择AUC的异质集成模型，基于改进BalanceCascade方法的集成模型效果更好，更加稳定。关键词：信用评分单一模型集成模型不平衡处理
英文摘要	With the rapid development of consumer finance in recent years, personal credit business has also developed rapidly, not only with the increase of various online lending platforms, but also with the gradual enrichment of loan varieties, covering almost all aspects of personal production and life. However, the challenge from credit risk is getting more and more serious, and risk assessment through credit score of applicants is especially important. Currently, there are many credit scoring models, but different models have their own advantages and disadvantages. Previous studies have found that a single model is faster to train but has poor prediction accuracy and stability; if a suitable base classifier can be selected for integration, the prediction error can be reduced to a certain extent and the accuracy can be improved; moreover, in practice, due to the limitations of the credit score dataset itself, the positive and negative sample categories are extremely different, and the handling of the imbalance problem also has an important impact on the model performance. Based on the above issues, the following research is conducted in this paper. In this paper, we use random forest method for feature selection, which is able to measure the importance of all feature attributes after fitting the data, compared with the feature selection method of information value commonly used in financial risk control, which avoids the operation of binning each feature and can directly obtain the ranking of feature importance, which is simpler to implement and more efficient in selecting features; according to the ranking of feature importance and business logic, we finally select According to the importance ranking of features and business logic, the features with importance greater than 0.1 are finally selected, and a total of 27 features are selected as entry variables. To test the application of different types of models in practice, four single models of logistic regression (LR), decision tree (DT), simple Bayesian (NB) and support vector machine (SVM) with better performance and higher recognition in credit score classification prediction were selected for experiments; after that, four single models of LR, NB, DT and SVM were used as base classifiers for Bagging integration respectively In order to test the actual classification effect, four classification algorithms with better performance, LR, NB, DT and SVM, were used as base classifiers, and the base classifier with the highest AUC was selected by bootstrap sampling to build a subset of data for adaptive voting. The classifier with the highest AUC is selected by bootstrap sampling to construct a new heterogeneous integration model for experiments. For the problem of imbalance between positive and negative samples in the credit score dataset, an improved BalanceCascade method is proposed, which trains the Adaboost classifier by extracting positive and negative samples to form a balanced dataset to control the classification error rate within a certain range and ensure the accuracy of removing positive samples; after that, according to the imbalance ratio of positive and negative samples, an adjustable parameter is set to ensure the accuracy of removing positive samples. After that, an adjustable parameter is set according to the imbalance ratio of positive and negative samples, and by continuously removing a certain proportion of positive samples, the remaining proportion of positive and negative samples is made close to this parameter, and experiments are conducted on data sets with different proportions of positive and negative samples, combined with the new hierarchical model for training to find the optimal proportion parameter. Because of the greater advantage of RF and XGBoost in accuracy in credit scoring, RF and XGBoost are chosen as the base classifier of the first layer, while the second layer model should not be too complex, too complex may lead to problems such as overfitting and poor generalization of the model on the training set, so the layer model is chosen as a more stable single model logistic regression as the base classifier, through the The experimental results of the credit dataset on the Ali Tianchi competition show that when the ratio of positive and negative samples is set to 2, the accuracy of the credit score integration model based on the improved BalanceCascade method reaches 0.80, the accuracy 0.90, the recall 0.84, the F1 value 0.88, and the AUC value 0.74, compared with the single classification model, the Bagging integration model, the Heterogeneous integration model for adaptive selection of AUC, the integration model based on the improved BalanceCascade method is better and more stable than other models. Keywords: Credit scoring; Single model; Ensemble model; Imbalance processing
学位类型	硕士
答辩日期	2022-05-29
学位授予地点	甘肃省兰州市
语种	中文
论文总页数	73
参考文献总数	63
馆藏号	0004261
保密级别	公开
中图分类号	C93/67
文献类型	学位论文
条目标识符	http://ir.lzufe.edu.cn/handle/39EH0E1M/32398
专题	信息工程与人工智能学院
推荐引用方式 GB/T 7714	刘伟. 基于改进BalanceCascade方法的信用评分集成模型研究[D]. 甘肃省兰州市. 兰州财经大学,2022.

条目包含的文件		下载所有文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
10741_2019000010008_（1855KB）	学位论文		开放获取	CC BY-NC-SA	浏览下载