基于文本语料的涉恐事件实体属性抽取

Entity and attribute extraction of terrorism event based on text corpus

  • 摘要: 基于语义角色分析,提出了一种三元组涉恐事件实体属性抽取方法,为网络空间涉恐活动的监测及预警提供技术支持。首先,基于西北政法大学“反恐怖主义信息网”文本语料数据进行数据采集和清洗等预处理工作,采用朴素贝叶斯文本分类算法识别涉恐事件文本,并采用关键词提取算法TF-IDF(Term frequency-inverse document frequency,词频-逆文档频率)构建涉恐专有词库,结合自然语言处理技术构建带词性的涉恐专有词库。然后通过语义角色分析、句法依存分析,提取了主语谓语宾语关系、定语后置动宾关系、人名//地名//机构和介宾关系主谓动补4类涉恐三元组结构。最后,利用正则表达式及带词性的涉恐专有名词分析,在4类三元组短文本中提取出恐怖事件发生时间、发生地点、伤亡情况、攻击方式、武器类型和恐怖组织6类实体属性。对采集的4221篇文章数据进行实验分析,6类实体属性抽取的测评结果F1值均超过80%,对网络空间的涉恐事件监测及预警,维护社会公共安全具有重要现实意义。

     

    Abstract: Affected by complex international factors in recent years, terrorism events are increasingly rampant in many countries, thereby posing a great threat to the gloal community. In addition, with the widespread use of emerging technologies in military and commercial fields, terrorist organizations have begun to use emerging technologies to engage in destructive activities. As the Internet and information technology develop, terrorism has been rapidly spreading in cyberspace. Terrorist organizations have created terrorism websites, established multinational networks of terrorist organizations, released recruitment information and even conducted training activities through various mainstream websites with a worldwide reach. Compared with traditional terrorist activities, cyber terrorist activities have a greater degree of destructiveness. Cybercrime and cyber terrorism have become the most serious challenges for societies. Terrorist organizations take advantage of the Internet in rapid dissemination of extremism ideas, and develop a large number of terrorists and supporters around the world, especially in developed Western countries. Terrorist organizations even use the Internet and “dark net” networks to conduct terrorist training, and their activities are concealed. As a result, the "lone wolf" terrorist attacks in various countries have emerged in an endless stream, which is difficult to prevent. This study proposed a method of extracting entities and attributes of terrorist events based on semantic role analysis, and provided technical support for monitoring and predicting cyberspace terrorism activities. Firstly, a naive Bayesian text classification algorithm is used to identify terrorism events on the cleaned text corpus collected from the Anti-Terrorism Information Site of the Northwest University of Political Science and Law. The keyword extraction algorithm TF-IDF is adopted for constructing the terrorism vocabularies from the classified text corpus, combining natural language processing technology. Then, semantic role and syntactic dependency analyses are conducted to mine the attributive post-targeting relationship, the name//place name//organization, and the mediator-like relationship. Finally, regular expressions and constructed lexical terrorism-specific vocabularies are used to extract six entities and attributes (occurrence time, occurrence location, casualties, attack methods, weapon types and terrorist organizations) of terrorism event based on the four types of triad short texts. The F1 values of the six types of entity attribute extraction evaluation results exceeded 80% based on the experimental data of 4221 articles collected. Therefore, the method proposed has practical significance for maintaining social public safety because of the positive effect in monitoring and predicting cyberspace terrorism events.

     

/

返回文章
返回