Python – 文本分类

Python – 文本分类

在很多情况下,我们需要按照某些预定义的标准将可用文本分类到各种类别中。nltk提供了这样的功能,作为各种语料库的一部分。在下面的示例中,我们查看了电影评论语料库并检查可用的分类。

# 让我们看看电影是如何分类的
from nltk.corpus import movie_reviews

all_cats = []
for w in movie_reviews.categories():
    all_cats.append(w.lower())
print(all_cats)

当我们运行上面的程序时,我们得到以下输出 –

['neg', 'pos']

现在让我们看一下具有正面评价的文件中的内容。在此文件中的句子被分词,并且我们打印前四个句子以查看样本。

from nltk.corpus import movie_reviews
from nltk.tokenize import sent_tokenize
fields = movie_reviews.fileids()

sample = movie_reviews.raw("pos/cv944_13521.txt")

token = sent_tokenize(sample)
for lines in range(4):
   print(token[lines])

运行上面的程序时,我们得到以下输出 –

meteor threat set to blow away all volcanoes & twisters !
summer is here again !
this season could probably be the most ambitious = season this decade with hollywood churning out films
like deep impact, = godzilla, the x-files, armageddon, the truman show,
all of which has but = one main aim, to rock the box office.
leading the pack this summer is = deep impact, one of the first few film
releases from the = spielberg-katzenberg-geffen's dreamworks production company.

接下来,我们在每个文件中标记单词,使用nltk的FreqDist函数找到最常见的单词。

import nltk
from nltk.corpus import movie_reviews
fields = movie_reviews.fileids()

all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)
print(all_words.most_common(10))

运行上面的程序时,我们得到以下输出 –

[(,', 77717), (the', 76529), (.', 65876), (a', 38106), (and', 35576), 
(of', 34123), (to', 31937), (u"'", 30585), (is', 25195), (in', 21822)]

Python教程

Java教程

Web教程

数据库教程

图形图像教程

大数据教程

开发工具教程

计算机教程