毕业设计项目中文文本分类 ( 机器学习和深度学习 )-深圳市維司達科技有限公司

文章目录

0 简介
1 前言
2 中文文本分类
3 数据集准备
4 经典机器学习方法
- 4.1 分词、去停用词
- 4.2 文本向量化 tf-idf
- 4.3 构建训练和测试数据
- 4.4 训练分类器
- - 4.4.1 logistic regression分类器
- 4.5 Random Forest 分类器
- 4.6 结论
5 深度学习分类器 - CNN文本分类
- 5.1 字符级特征提取

0 简介

今天学长向大家介绍一个毕设项目，中文文本分类技术

中文文本分类 ( 机器学习和深度学习 ) - 新闻分类情感分类垃圾邮件分类

🧿选题指导, 项目分享：见文末

1 前言

学长今天帮助同学开发项目，正好需要到文本分类，今天就带大家梳理一下中文文本分类的主要方法和流程

学长本片博客的目的主要记录学长自己构建文本分类系统的过程，分别构建基于传统机器学习的文本分类和基于深度学习的文本分类系统，并在同一数据集上进行测试。

2 中文文本分类

作为NLP领域最经典的场景之一，文本分类积累了大量的技术实现方法，如果将是否使用深度学习技术作为标准来衡量，实现方法大致可以分成两类：

基于传统机器学习的文本分类
基于深度学习的文本分类

facebook之前开源的fastText属于简化版的第二类，词向量取平均直接进softmax层，还有业界研究上使用比较多的TextCNN模型属于第二类。

经典的机器学习方法采用获取tf-idf文本特征，分别喂入logistic regression分类器和随机森林分类器的思路，并对两种方法做性能对比。

基于深度学习的文本分类，这里主要采用CNN对文本分类，考虑到RNN模型相较CNN模型性能差异不大并且耗时还比较久，这里就不多做实验了。

实验过程有些比较有用的small trick分享，包括多进程分词、训练全量tf-idf、python2对中文编码的处理技巧等等，在下文都会仔细介绍。

3 数据集准备

本文采用的数据集是很流行的搜狗新闻数据集，get到的时候已经是经过预处理的了，所以省去了很多数据预处理的麻烦，数据集内容如下

数据集一共包括10类新闻，每类新闻65000条文本数据，训练集50000条，测试集10000条，验证集5000条。

4 经典机器学习方法

4.1 分词、去停用词

使用短文本分类博文中提到的分词工具类，对训练集、测试集、验证集进行多进程分词，以节省时间：

importmultiprocessing tmp_catalog='/home/zhouchengyu/haiNan/textClassifier/data/cnews/'file_list=[tmp_catalog+'cnews.train.txt',tmp_catalog+'cnews.test.txt']write_list=[tmp_catalog+'train_token.txt',tmp_catalog+'test_token.txt']deftokenFile(file_path,write_path):word_divider=WordCut()withopen(write_path,'w')asw:withopen(file_path,'r')asf:forlineinf.readlines():line=line.decode('utf-8').strip()token_sen=word_divider.seg_sentence(line.split('\t')[1])w.write(line.split('\t')[0].encode('utf-8')+'\t'+token_sen.encode('utf-8')+'\n')printfile_path+' has been token and token_file_name is '+write_path pool=multiprocessing.Pool(processes=4)forfile_path,write_pathinzip(file_list,write_list):pool.apply_async(tokenFile,(file_path,write_path,))pool.close()pool.join()# 调用join()之前必须先调用close()print"Sub-process(es) done."

4.2 文本向量化 tf-idf

这里有几点需要注意的，一是计算tf-idf是全量计算，所以需要将train+test+val的所有corpus都相加，再进行计算，二是为了防止文本特征过大，需要去低频词，因为是在jupyter上写的，所以测试代码的时候，先是选择最小的val数据集，成功后，再对test,train数据集迭代操作，希望不要给大家留下代码冗余的影响…[悲伤脸]。实现代码如下：

defconstructDataset(path):""" path: file path rtype: lable_list and corpus_list """label_list=[]corpus_list=[]withopen(path,'r')asp:forlineinp.readlines():label_list.append(line.split('\t')[0])corpus_list.append(line.split('\t')[1])returnlabel_list,corpus_list tmp_catalog='/home/zhouchengyu/haiNan/textClassifier/data/cnews/'file_path='val_token.txt'val_label,val_set=constructDataset(tmp_catalog+file_path)printlen(val_set)fromsklearn.feature_extraction.textimportTfidfTransformerfromsklearn.feature_extraction.textimportCountVectorizer tmp_catalog='/home/zhouchengyu/haiNan/textClassifier/data/cnews/'write_list=[tmp_catalog+'train_token.txt',tmp_catalog+'test_token.txt']tarin_label,train_set=constructDataset(write_list[0])# 50000test_label,test_set=constructDataset(write_list[1])# 10000# 计算tf-idfcorpus_set=train_set+val_set+test_set# 全量计算tf-idfprint"length of corpus is: "+str(len(corpus_set))vectorizer=CountVectorizer(min_df=1e-5)# drop df < 1e-5,去低频词transformer=TfidfTransformer()tfidf=transformer.fit_transform(vectorizer.fit_transform(corpus_set))words=vectorizer.get_feature_names()print"how many words: {0}".format(len(words))print"tf-idf shape: ({0},{1})".format(tfidf.shape[0],tfidf.shape[1])""" length of corpus is: 65000 how many words: 379000 tf-idf shape: (65000,379000) """

4.3 构建训练和测试数据

因为本来文本就是以一定随机性抽取成3份数据集的，所以，这里就不shuffle啦，偷懒一下下。

fromsklearnimportpreprocessing# encode labelcorpus_label=tarin_label+val_label+test_label encoder=preprocessing.LabelEncoder()corpus_encode_label=encoder.fit_transform(corpus_label)train_label=corpus_encode_label[:50000]val_label=corpus_encode_label[50000:55000]test_label=corpus_encode_label[55000:]# get tf-idf datasettrain_set=tfidf[:50000]val_set=tfidf[50000:55000]test_set=tfidf[55000:]

4.4 训练分类器

4.4.1 logistic regression分类器

fromsklearn.linear_modelimportLogisticRegressionfromsklearn.metricsimportclassification_report# from sklearn.metrics import confusion_matrix# LogisticRegression classiy modellr_model=LogisticRegression()lr_model.fit(train_set,train_label)print"val mean accuracy: {0}".format(lr_model.score(val_set,val_label))y_pred=lr_model.predict(test_set)printclassification_report(test_label,y_pred)

分类结果如下（包括准确率、召回率、F1值）:

4.5 Random Forest 分类器

# 随机森林分类器fromsklearn.ensembleimportRandomForestClassifier rf_model=RandomForestClassifier(n_estimators=200,random_state=1080)rf_model.fit(train_set,train_label)print"val mean accuracy: {0}".format(rf_model.score(val_set,val_label))y_pred=rf_model.predict(test_set)printclassification_report(test_label,y_pred)

分类结果（包括准确率、召回率、F1值）:

4.6 结论

1 上面采用逻辑回归分类器和随机森林分类器做对比：
2 可以发现，除了个别分类随机森林方法有较大进步，大部分都差于逻辑回归分类器
3 并且200棵树的随机森林耗时过长，比起逻辑回归分类器来说，运算效率太低

5 深度学习分类器 - CNN文本分类

5.1 字符级特征提取

这里和前文差异比较大的地方，主要是提取文本特征这一块，这里的CNN模型采用的是字符级特征提取，比如data目录下cnews_loader.py中：

defread_file(filename):"""读取文件数据"""contents,labels=[],[]withopen_file(filename)asf:forlineinf:try:label,content=line.strip().split('\t')contents.append(list(content))# 字符级特征labels.append(label)except:passreturncontents,labelsdefbuild_vocab(train_dir,vocab_dir,vocab_size=5000):"""根据训练集构建词汇表，存储"""data_train,_=read_file(train_dir)all_data=[]forcontentindata_train:all_data.extend(content)counter=Counter(all_data)count_pairs=counter.most_common(vocab_size-1)words,_=list(zip(*count_pairs))# 添加一个 <PAD> 来将所有文本pad为同一长度words=['<PAD>']+list(words)

学长这里做了一下测试：

#! /bin/env python# -*- coding: utf-8 -*-fromcollectionsimportCounter""" 字符级别处理, 对于中文来说，基本不是原意的字，但是也能作为一种统计特征来表征文本 """content1="你好呀大家"content2="你真的好吗？"# content = "abcdefg"all_data=[]all_data.extend(list(content1))all_data.extend(list(content2))# print list(content) # 字符级别处理# print "length: " + str(len(list(content)))counter=Counter(all_data)count_pairs=counter.most_common(5)words,_=list(zip(*count_pairs))words=['<PAD>']+list(words)#['<PAD>', '\xe5', '\xbd', '\xa0', '\xe4', '\xe7']