作者宋頔
姓名汉语拼音Song Di
学号2019000003037
培养单位兰州财经大学
电话18193192875
电子邮件3228872499@qq.com
入学年份2019-9
学位类别专业硕士
培养级别硕士研究生
一级学科名称应用统计
学科代码0252
授予学位应用统计硕士专业学位
第一导师姓名刘明
第一导师姓名汉语拼音Liu Ming
第一导师单位兰州财经大学
第一导师职称教授
第二导师姓名荣良骥
第二导师姓名汉语拼音Rong Liangji
第二导师单位甘肃省科学技术情报研究所
第二导师职称副研究员
题名基于百度指数的公众关注度对铁路客运量的混频预测研究
英文题名Research on Mixing Frequency Prediction of Railway Passenger Traffic by Public Attention Based on Baidu Index
关键词TF-IDF 递归特征消除 随机森林特征重要性选择 Lasso回归 MIDAS混频数据模型
外文关键词TF-IDF ; recursive feature elimination ; random forest feature importance selection ; Lasso regression ; MIDAS mixing data model
摘要

大数据网络搜索引擎在推动我国综合交通运输体系中的重要地位日益凸显。本文首先通过对微博文本进行处理,选取关键词词库,并划分为三类关键词,即客运基础词、疫情相关词和月份特征词,然后基于百度指数构建公众关注度指标,选取“百度”作为中性词,计算百度搜索热度指标作为公众关注度指标。在实证部分通过递归特征消除和四种特征重要性回归方法,对公众关注度指标进行筛选,将Lasso回归系数的绝对值占比作为衡量公众关注度指标特征重要性的依据,从而对公众关注度指标进行降维合并。之后按照Lasso回归系数的绝对值占比将月度公众关注度指标还原成日数据和周数据,利用混频数据模型探究公众关注度和铁路客运量之间的统计关系。其过程为分别采用三种不同的估计窗口进行混频回归,即固定窗口、滚动窗口和递归窗口,同时用三种同频低频回归方法进行预测效果的对比,即普通最小二乘回归OLS、自回归分布滞后模型ARDL和广义自回归条件异方差模型GARCH。预测效果的衡量指标采用均方根误差RMSE,为了方便比较进一步计算相对均方根误差rRMSE进行预测效果的对比。之后将计算结果与1进行比较,得出不同时段混频回归预测效果相对于同频回归预测效果更强的结论。

论文结论可以从数据频率预测效果和混频估计窗口预测效果两个方面进行阐述:首先在数据频率预测效果方面,2011年1月至2017年6月的第一阶段中,混频估计相对于三种同频估计方法的72次估计均优于同频模型,优化占比为100%,其中日-月混频和周-月混频均被选择6次,因此选择日-月或周-月混频方法均可,两者的预测效果可视为一致。2017年7月至2020年1月的第二阶段中,混频估计相对于三种同频估计方法的72次估计中有69次优于同频模型预测效果,优化占比为95.83%,其中日-月混频被选择6次,周-月混频被选择5次,因此最终选择频率较高的日-月混频方法作为估计方法。2020年2月至2021年8月的第三阶段中,日-月混频相对于三种同频估计方法的36次估计中有29次优于同频模型预测效果,优化占比为80.56%,周-月混频相对于三种同频估计方法的36次估计中有33次优于同频模型预测效果,优化占比91.67%,两者的算术平均值为86.12%,其中日-月混频被选择6次,周-月混频也被选择6次,因此最终选择日-月或周-月混频方法均可,两者的预测效果可视为一致。

其次在混频估计窗口预测效果方面,第一阶段所选两种混频回归方法均可,但两种混频模型对估计窗口选择的优劣排序不完全一致,其中日-月混频的预测效果排序为滚动窗口>递归窗口>固定窗口,周-月混频的预测效果排序为滚动窗口=递归窗口>固定窗口。第二阶段所选混频频率为日-月混频,两种混频模型对估计窗口选择的优劣排序一致,均为滚动窗口>固定窗口>递归窗口。第三阶段所选混频频率为两种混频方法均可,两种混频模型对估计窗口选择的优劣排序一致,均为递归窗口>滚动窗口>固定窗口。最后对三个时间段中混频预测的优化占比求算术平均值,为93.98%,高于50%,说明混频预测的效果在全部三个时间阶段的预测中强于三种同频低频估计方法。

最后,本文对样本外数据,即2021年9月至2022年2月的铁路客运量进行了估计方法分别为beta和expAlmon的预测,混频的频率依然为日-月和周-月混频,发现对于日-月混频预测来说,估计方法为expAlmon的混频预测均方误差小于beta估计方法,而对于周-月混频预测,两种估计方法的均方误差和预测结果完全一致,因此建议采用估计方法为expAlmon的固定窗口混频方法预测样本外数据。

总体来说,混频模型相对同频低频模型能够更加精确地对铁路客运量进行预测。采用三种时间窗口的估计和不同数据频率的估计使得估计结果和预测效果不同,不同时间段的划分也使得估计结果和预测效果不同,但最终具体采用哪一种混频回归周期和估计窗口方法是根据预测误差指标来进行衡量,并据此选择最优的混频频率和估计窗口。同时,本文根据得出的结论为铁路相关部门制定政策规划和游客出行线路规划提供一定的参考依据。

英文摘要

Big data network search engine is playing an increasingly important role in promoting our country's integrated transportation system. This article firstly carries on the processing to the microblog textbook, selects the keyword thesaurus, divides it into three kinds of keywords, namely the passenger transport basic word, the epidemic related word, and the month characteristic word, then constructs the public attention index based on Baidu Index, chooses "Baidu" as the neutral word, calculates Baidu Search Heat Index as the public attention index. Through recursive feature elimination and four methods of feature importance regression, the index of public attention was selected in the empirical part, and the absolute proportion of the Lasso regression coefficient was used as the basis to measure the feature importance of the index of public attention so that the index of public attention was combined in a reduced dimension. Then, according to the absolute percentage of the Lasso regression coefficient, the monthly public attention index was reduced to daily and weekly data, and the statistical relationship between public attention and railway passenger traffic was investigated by using the mixed data model. The process was to use three different estimation windows, namely fixed window, rolling window, and recursive window, and to compare the prediction effect with three other low-frequency regression methods, namely ordinary least square regression OLS, self-regression distribution lag model ARDL and generalized self-regression conditional heteroscedasticity model GARCH. Mean square root error RMSE was used to measure the prediction effect. In order to compare the prediction effect, the relative mean square root error rRMSE was used to calculate the prediction effect. Then the results are compared with 1, and it is concluded that the prediction effect of mixed frequency regression is stronger than that of the same frequency regression.

The results can be described from two aspects: data frequency prediction effect and mixing estimation window prediction effect: First, in the first phase of data frequency prediction, from January 2011 to June 2017, the hybrid estimate was superior to the homofrequency model compared to 72 estimates of the three homofrequency estimation methods, with an optimization ratio of 100%, in which the day-month and week-month mixes were both selected 6 times, so either day-month or week-month mixes could be selected, and the prediction effect of both mixes could be considered consistent. In the second phase from July 2017 to January 2020, 69 out of 72 estimates of mixing versus the three cfrequency estimation methods were better than the homofrequency model predictions, with an optimization ratio of 95.83%, in which the day-month mixing method was selected 6 times and the week-month mixing method was selected 5 times, so the high frequency of day-month mixing method was eventually selected as the estimation method. In the third phase from February 2020 to August 2021, 29 of the 36 estimates of the daily-monthly mixing versus the three homofrequency estimation methods were better than the prediction of the homofrequency model, with an optimization ratio of 80.56%, and 33 of the 36 estimates of the weekly- monthly mixing versus the three homofrequency estimation methods were better than the prediction of the homofrequency model, with an optimization ratio of 91.67% and the arithmetic mean of the two is 86.12%,of these, the day-month mix is selected six times and the week- month mix is also selected six times, so either the day-month or the week-month mix is ultimately selected, the predictive effects of both can be considered consistent.

Secondly, in terms of the prediction effect of the hybrid estimation window, the two methods selected in the first stage were both acceptable, but the order of the estimation window was not identical between the two models, of which the prediction effect of the day-month mixing was ranked as rolling window > recursive window > fixed window, and the prediction effect of the week-month mixing was ranked as rolling window = recursive window > fixed window. In the second stage, the selected frequency is day-month mixing, and the two mixing models have the same order of preference for the superior order for the estimated window selection, of both which are rolling windows > fixed windows > recursive windows. In the third stage, the selected mixing frequency can be either of the two mixing methods, and the two mixing models have the same ranking on the advantages and disadvantages of the estimated window selection, both of which are recursive windows > rolling windows > fixed windows. Finally, the optimization ratio of mixing prediction in all three time periods is 93.98%, which is higher than 50%, which illustrates the effect of mixed frequency prediction is stronger in all three stages than in the same low-frequency estimation methods.

Finally, this paper estimates the out-of-sample data, that is, the railway passenger traffic from September 2021 to February 2022, respectively, the predictions of beta and expAlmon, and the frequency of mixing is still day-month and week-month mixing, and it is found that for the day-month mixing prediction, the mean squared error of the mixing prediction of the expAlmon estimation method is less than that of the beta estimation method, while for the week-month mixing prediction, the mean squared error and the prediction results of the two estimation methods are completely consistent. Therefore, it is recommended to use the fixed window mixing method of expAlmon to predict the out-of-sample data.

In general, the hybrid model can predict the railway passenger traffic more accurately than the low-frequency model. The estimation of three windows and different data frequencies make the estimation and prediction effect different, and the division of different periods makes the estimation and prediction effect different, but in which the mixing regression cycle and the estimation window method are based on the prediction error index to select the optimal mixing frequency and estimation window. At the same time, this article provides some reference basis for railway departments to make policy planning and tourist itinerary planning according to the conclusion.

学位类型硕士
答辩日期2022-05-15
学位授予地点甘肃省兰州市
研究方向大数据分析
语种中文
论文总页数108
参考文献总数84
馆藏号0004296
保密级别公开
中图分类号C8/301
文献类型学位论文
条目标识符http://ir.lzufe.edu.cn/handle/39EH0E1M/32488
专题统计与数据科学学院
推荐引用方式
GB/T 7714
宋頔. 基于百度指数的公众关注度对铁路客运量的混频预测研究[D]. 甘肃省兰州市. 兰州财经大学,2022.
条目包含的文件 下载所有文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
2019000003037.pdf(3072KB)学位论文 开放获取CC BY-NC-SA浏览 下载
个性服务
查看访问统计
谷歌学术
谷歌学术中相似的文章
[宋頔]的文章
百度学术
百度学术中相似的文章
[宋頔]的文章
必应学术
必应学术中相似的文章
[宋頔]的文章
相关权益政策
暂无数据
收藏/分享
文件名: 2019000003037.pdf
格式: Adobe PDF
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。