Abstract:
Internet public opinion is an important source of people's views on social hotspots and national current affairs. Topic detection in network long text contributes toward the analysis of network public opinion. According to the results of topic detection, the policymaker can timely and reliably make scientific decisions. In general, topic detection can be divided into two steps, i.e., representation learning and topic discovery. However, common representation learning methods, such as state vector space model (VSM) and term frequency-inverse document frequency, often lead to the problems of high dimensionality, sparsity, and latent semantic loss, whereas traditional topic discovery methods depend heavily on the text input orders. To overcome these, a novel topic detection method was presented herein. First, Word2vec & latent Dirichlet allocation (LDA)-based methods for representation learning were proposed to avoid the problem of high-dimensional sparsity and neglect of latent semantics. Weighted fusion of the text feature word implicit topic extracted by LDA and the feature word vector of Word2vec mapping could not only perform dimensionality reduction but also completely represent text information. Furthermore, Single-Pass and hierarchical agglomerative clustering for topic discovery could be more robust for input orders. To evaluate the effectiveness and efficiency of the proposed method, extensive experiments were conducted on a real-world multi-source dataset, which was collected from university social platforms. The experimental results show that the proposed method outperforms other methods, such as VSM and Single-Pass, by improving the clustering accuracy by 10%-20%.