基于改进词向量模型的深度学习文本主题分类
Deep Learning on Improved Word Embedding Model for Topic Classification
DOI: 10.12677/CSA.2016.611077, PDF, HTML, XML,  被引量 下载: 2,822  浏览: 10,319  科研立项经费支持
作者: 周盈盈, 范磊:上海交通大学信息安全工程学院,上海
关键词: 主题分类深度学习卷积神经网络词向量Topic Classification Deep Learning Convolutional Neural Network Word Embedding
摘要: 主题分类在内容检索和信息筛选中应用广泛,其核心问题可分为两部分: 文本表示和分类模型。近年来,基于分布式词向量对文本进行表示,使用卷积神经网络作为分类器的文本主题分类方法取得了较好的分类效果。本文研究了不同词向量对卷积神经网络分类效果的影响,提出针对中文语料的topic2vec词向量模型。本文利用该模型,对具有代表性的互联网内容生成社区“知乎”进行了实验与分析。实验结果表明,利用topic2vec词向量的卷积神经网络,在长内容文本和短标题文本的分类问题中分别取得了98.06%,93.27%的准确率,较已知词向量模型均有显著提高。
Abstract: Topic classification has wide applications in content searching and information filtering. It can be divided into two core parts: text embedding and classification modeling. In recent years, methods have brought out significant results using distributed word embedding as input and convolutional neural network (CNN) as classifiers. This paper discusses the impact of different word embedding for CNN classifiers, proposes topic2vec, a new word embedding specifically suitable for Chinese corpora, and conducts an experiment on Zhihu, a representative content-oriented internet com-munity. The experiment turns out that CNN with topic2vec gains an accuracy of 98.06% for long content texts, 93.27% for short title texts and an improvement comparing with other word em-bedding models.
文章引用:周盈盈, 范磊. 基于改进词向量模型的深度学习文本主题分类[J]. 计算机科学与应用, 2016, 6(11): 629-637. http://dx.doi.org/10.12677/CSA.2016.611077

参考文献

[1] Mikolov, T., Sutskever, I., Chen, K., Corrado, G., et al. (2013) Distributed Representations of Words and Phrases and Their Compo-sitionality. Advances in Neural Information Processing Systems, 3111-3119.
[2] Kim, Y. (2014) Convolutional Neural Networks for Sentence Classification.
[3] Kalchbrenner, N., Grefenstette, E. and Blunsom, P. (2014) A Convolutional Neural Network for Modelling Sentences. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, 655-665.
[4] Johnson, R. and Zhang, T. (2015) Semi-Supervised Convolutional Neural Networks for Text Categorization via Region Embedding. Advances in Neural Information Processing Systems, 919-927.
[5] Zhang, Y. and Wallace, B. (2015) A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification.
[6] Pang, B. and Lee, L. (2005) Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales. Proceedings of the ACL. http://dx.doi.org/10.3115/1219840.1219855
[7] Pennington, J., Socher, R. and Manning, C.D. (2014) Glove: Global Vectors for Word Representation. Proceedings of the Empiricial Methods in Natural Language Processing. http://dx.doi.org/10.3115/v1/d14-1162
[8] Rumelhart, D.E., Hinton, G.E. and Williams, R. (1988) Learning Representations by Back-Propagating Errors. Cognitive Modeling, 5, 3.
[9] Boureau, Y.L., Ponce, J. and Le Cun, Y. (2010) A Theoretical Analysis of Feature Pooling in Visual Recognition. Proceedings of the 27th International Conference on Machine Learning (ICML-10), 111-118.
[10] Hinton, G.E. and Salakhutdinov, R.R. (2009) Replicated Softmax: An Undirected Topic Model. Advances in Neural Information Processing Systems, 1607-1614.
[11] Hoffman, M., Bach, F.R. and Blei, D.M. (2010) Online Learning for Latent Di-richlet Allocation. Advances in Neural Information Processing Systems, 856-864.
[12] Blei, D.M., Ng, A.Y. and Jordan, M.I. (2003) Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993-1022.
[13] Manning, C.D., Surdeanu, M., Bauer, J., et al. (2014) The Stanford CoreNLP Natural Language Processing Toolkit. ACL (System Demonstrations), 55-60.
[14] Bergstra, J., Breu-leux, O., Bastien, F., et al. (2010) Theano: A CPU and GPU Math Compiler in Python. Proceedings of the 9th Python in Science Conference, 1-7.