作者秦伟德
姓名汉语拼音Qin WeiDe
学号2020000010010
培养单位兰州财经大学
电话15951672201
电子邮件wade10286@163.com
入学年份2020-9
学位类别学术硕士
培养级别硕士研究生
学科门类管理学
一级学科名称管理科学与工程
学科方向
学科代码1201
授予学位硕士学位
第一导师姓名杨海军
第一导师姓名汉语拼音Yang HaiJun
第一导师单位兰州财经大学
第一导师职称教授
题名基于深度学习的政务网站人事信息知识图谱构建研究
英文题名Research on Knowledge Graph Construction of Personnel Information of Government Affairs Website Based on Deep Learning
关键词人事信息 联合抽取 知识图谱 小程序
外文关键词Personnel Information ; Joint Extraction ; Knowledge Graph ; Mini Programs
摘要

随着政务信息化发展水平的不断提高,政务网站也产生了越来越多的政务数据,在这些数据中,人事信息数据是非常重要并且具有较高研究价值的一种数据。一方面,人是各单位各部门的主体,与人相关的信息是各个决策的基础,具有重要作用;另一方面当前的人事信息主要存在于独立的人事任免、政事新闻等文本里,这些信息各自分散缺少关联,浪费了信息本身存在的价值。当下自然语言处理相关技术正在飞速发展,为处理政务人事信息数据提供了技术支撑,知识图谱的应用也为相关学术研究和领域发展提供了直观有效的工具。深度学习能够从非结构化的文本数据中提取结构化的三元组信息,在此基础上可以构建政务人事信息知识图谱,但是目前仍然存在着数据集的匮乏、现有算法模型精度不够高等问题。针对以上问题,本文的主要研究内容如下:

1)建立政务网站人事信息数据集。使用Python语言编写爬虫程序从政务网站中获取原始文本数据并进行数据清洗,确定好实体类型和关系类型后对文本数据进行标注,最后把文本转换成特定格式完成数据集的创建。数据集共包含10种关系类别和对应的实体类别,为后续的实验研究提供数据支撑,除此之外也能推动政务人事信息实体关系抽取领域的发展。

2)提出了融合依存句法分析的GCN-CasRel实体关系联合抽取模型。模型基于CasRel模型端到端的级联二进制标记框架,在解决政务人事信息文本中出现的三元组重叠问题的同时,采用图卷积神经网络对依存句法关系进行建模,使模型更好的捕获句法结构信息,同时引入注意力机制过滤依存句法树的噪声,提高实体关系抽取性能。经过与其他联合抽取模型实验对比,本模型的查准率和F1值都有较高的提升,证明了模型的有效性。

3)构建政务网站人事信息知识图谱。利用已提出的实体关系联合抽取模型开发了一款基于微信平台的政务人事信息实体关系抽取小程序。微信小程序分为前端页面、后端服务器以及数据端等三部分,使用者可以通过上传文本或者网址来直接获得三元组抽取结果,并在此基础上采用Neo4j图数据库对小程序抽取的人事信息三元组进行可视化,完成政务网站人事信息知识图谱的创建。

本文通过爬取政务网站文本数据,经过预处理建立政务网站人事信息数据集,在端到端的级联二进制标记框架基础上引入图卷积神经网络建模依存句法关系,并利用注意力机制过滤依存句法树的噪声,经实验证明模型具有良好的效果。基于微信平台开发了政务人事信息实体关系抽取小程序,利用小程序对甘肃省部分政务文本进行实体关系抽取,获得人事三元组信息后利用Neo4j图数据库完成知识图谱可视化。

英文摘要

With the continuous improvement of the development level of government information, the government website has also produced more and more government data, among which the personnel information data is a very important and high research value data. On the one hand, people are the subject of each unit and department, and the information related to people is the basis of each decision, which has an important role; on the other hand, the current personnel information mainly exists in the independent text of personnel appointments and dismissals, political news, etc., which are scattered and unrelated, which wastes the value of the information itself. The technology related to natural language processing is developing rapidly nowadays, which provides technical support for processing personnel information data of government affairs, and the application of knowledge graph also provides an intuitive and effective tool for related academic research and field development. Deep learning can extract structured triadic information from unstructured text data, on which the knowledge graph of government personnel information can be built, but there are still problems such as the lack of data sets and the lack of accuracy of existing algorithm models. To address the above issues, the main research of this paper is as follows:

(1) Create a dataset of personnel information from the government website. We use Python language to write a crawler program to get the original text data from the government website and clean the data, determine the entity type and relationship type and then annotate the text data, and finally convert the text into a specific format to complete the creation of the dataset. The dataset contains a total of 10 relationship categories and corresponding entity categories, which can provide data support for subsequent experimental research and promote the development of entity relationship extraction in the field of government personnel information.

 (2) A joint GCN-CasRel entity relationship extraction model incorporating dependency syntax analysis is proposed. The model is based on the end-to-end cascaded binary tagging framework of CasRel model. While solving the triad overlap problem in the text of government personnel information, the model uses graph convolutional neural network to model the dependent syntactic relations so that the model can better capture the syntactic structure information, and introduces the attention mechanism to filter the noise of the dependent syntactic tree to improve the entity relation extraction performance. After the experimental comparison with other joint extraction models, the accuracy rate and F1 value of this model have been improved, which proves the effectiveness of the model.

(3) Constructe a knowledge graph of personnel information of governmental websites. An entity relationship extraction applet based on the WeChat platform is developed using the proposed joint entity relationship extraction model. The applet is divided into three parts: front-end page, back-end server and data side. Users can directly obtain the triad extraction results by uploading text or URL, and on this basis, the Neo4j graph database is used to visualize the personnel information triad extracted by the applet and complete the creation of the personnel information knowledge map of the government affairs website.

In this paper, we crawl the text data of government websites, establish the personnel information dataset of government websites after pre-processing, introduce graph convolutional neural network to model the dependent syntactic relations based on the end-to-end cascaded binary tagging framework, and filter the noise of the dependent syntactic tree by using the attention mechanism, and prove the model has good effect by experiment. Based on the WeChat platform, we developed an applet for extracting entity relations of government personnel information, and used the applet to extract entity relations of some government texts in Gansu Province, and completed the knowledge graph visualization using Neo4j graph database after obtaining the personnel triad information.

学位类型硕士
答辩日期2023-05-20
学位授予地点甘肃省兰州市
研究方向信息管理与信息系统
语种中文
论文总页数76
参考文献总数75
馆藏号0004972
保密级别公开
中图分类号C93/81
文献类型学位论文
条目标识符http://ir.lzufe.edu.cn/handle/39EH0E1M/33867
专题信息工程与人工智能学院
推荐引用方式
GB/T 7714
秦伟德. 基于深度学习的政务网站人事信息知识图谱构建研究[D]. 甘肃省兰州市. 兰州财经大学,2023.
条目包含的文件 下载所有文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
1_10741_202000001001(2449KB)学位论文 开放获取CC BY-NC-SA浏览 下载
个性服务
查看访问统计
谷歌学术
谷歌学术中相似的文章
[秦伟德]的文章
百度学术
百度学术中相似的文章
[秦伟德]的文章
必应学术
必应学术中相似的文章
[秦伟德]的文章
相关权益政策
暂无数据
收藏/分享
文件名: 1_10741_2020000010010_LW.pdf
格式: Adobe PDF
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。