汉语文本分类研究_计算机科学与技术.rar

资料分类:计算机信息 上传会员:学大教育 更新时间:2013-10-08
需要金币1000 个金币 资料包括:完整论文 下载论文
转换比率:金额 X 10=金币数量, 例100元=1000金币 论文字数:13295
折扣与优惠:团购最低可5折优惠 - 了解详情 论文格式:Word格式(*.doc)

摘要:文本分类是自然语言处理的一个重要应用领域,在信息检索、数字图书馆、文本过滤等方面有着重要地位。文本分类,能够推动文档管理工作走向科学化、规范化,使之适应现代管理制度的要求。

   本文首先介绍了文本分类的研究背景和意义及其在国内外的研究现状;其次对实现文本分类系统的过程中使用的相关技术和算法,分别进行了详细阐述;接着在介绍了中文信息处理、文本分类技术和算法的基础上,实现了一个基于向量空间模型的汉语文本分类系统,就是通过特征选择,对训练样本集合构建类模型,并以该模型作为文本自动分类时的依据设计分类器,先后采用ROCCHIO、KNN文本分类方法对文本进行分类;最后对实验结果进行了分析与评价。

   文本自动分类主要包括文本模型、训练、分类、性能评估四个过程。首先对文本进行预处理,将文本用模型表示,进行特征提取;接着构造并训练分类器;然后用分类器对新文本进行分类;最后对分类性能进行评估。

   本实验所选用的中文语料分为训练语料和测试预料两部分,其中包括计算机、环境、军事、交通、教育、经济、体育、医药、艺术、政治,共10类,训练语料1430篇,测试预料195篇,共计1525篇。实验数据表明,特征抽取方法MI的分类性能随着特征维数的增加分类性能变化明显,KNN中K值的选取也对分类器的性能有较大的影响;当特征维数和K值都选取最佳时,KNN分类器的宏平均查准率达到91.9%,宏平均查全率达到90.8%,具有较理想的精准率和查全率, ROCCHIO分类器的宏平均查准率达到54.9%,宏平均查全率达到45.1%,相对于KNN算法而言,分类性能不理想。

关键字:文本分类,向量空间模型,特征提取,训练样本

 

Abstract:Text classification is an important natural language processing applications, in information retrieval, digital library, text filtering, and so has an important position. Text classification can make document management work to promote the scientific, standardized and adapt to a modern management system requirements.

   This article introduces the research background of the text classification and significance of their research status at home and abroad; Secondly, the process of realization of the text classification system used in related technologies and algorithms are described in detail; Then based on the introduction of the Chinese information processing and Text classification techniques and algorithms, showing a Chinese text categorization system based on a vector space model. That is through the selection, the training sample set of model building classes, and to the model as a basis for automatic text classification design classifier. Using ROCCHIO、KNN text classification method to classify the text; Finally, experimental results are analyzed and evaluated.

   Text categorization includes text model, training, classification, performance evaluation of. First, pretreatment of the text and said the text with the model to construct and train the classifier; then constructed and trained classifier; then use the classifier to classify new text; finally, evaluate the classification performance.

   The Chinese used in this experiment were divided into training data and test corpus is expected in two parts, including computers, environmental, military, transportation, education, economy, sports, medicine, art, politics, a total of 10 categories, the training is expected to 1430, the test is expected to 195, a total of 1525. Experimental data show that feature extraction method with the characteristics of MI classification performance of the dimension changed significantly increased classification performance, KNN in the selection of K value on the classification performance but also have a greater impact; when the feature dimension and K both select the best value,KNN classifier achieved 91.9% precision rate, recall rate of 90.8%, with better precision and recall rate, ROCCHIO classifier precision rate 54.9%, recall rate of 45.1%, compared with KNN algorithm, the classification performance is not satisfactory.

Key words: Text classification, vector space model, feature extraction, training samples

 

相关论文资料:
最新评论
上传会员 学大教育 对本文的描述:文本自动分类主要包括文本模型、训练、分类、性能评估四个过程。首先对文本进行预处理,将文本用模型表示,进行特征提取;接着构造并训练分类器;然后用分类器对新文本进行分......
发表评论 (我们特别支持正能量传递,您的参与就是我们最好的动力)
注册会员后发表精彩评论奖励积分,积分可以换金币,用于下载需要金币的原创资料。
您的昵称: 验证码: