当前位置：极客教程 > Python > Python 文本处理 > Python – 过滤重复词汇

Python – 过滤重复词汇

Python – 过滤重复词汇

很多时候，我们要分析文本中仅出现一次的词语。所以，我们需要从文本中消除重复的单词。这可以通过在nltk中使用词汇标记和集合函数来实现。

不保留顺序

在以下示例中，我们首先将句子分词为单词。然后我们应用set()函数，它创建一个无序的独特元素集合。结果是独特的单词，它们没有排序。

import nltk
word_data = "The Sky is blue also the ocean is blue also Rainbow has a blue colour." 

# 首先进行单词分词
nltk_tokens = nltk.word_tokenize(word_data)

# 应用set
no_order = list(set(nltk_tokens))

print no_order

当我们运行上面的程序时，我们得到以下输出：

['blue', 'Rainbow', 'is', 'Sky', 'colour', 'ocean', 'also', 'a', '.', 'The', 'has', 'the']

保留顺序

要在保留句子中的单词顺序的情况下获取删除重复项后的单词，我们读取单词并将其添加到列表中。通过附加它来实现。

import nltk
word_data = "The Sky is blue also the ocean is blue also Rainbow has a blue colour." 

# 首先进行单词分词
nltk_tokens = nltk.word_tokenize(word_data)

ordered_tokens = set()
result = []
for word in nltk_tokens:
    if word not in ordered_tokens:
        ordered_tokens.add(word)
        result.append(word)

print result

当我们运行上面的程序时，我们得到以下输出：

['The', 'Sky', 'is', 'blue', 'also', 'the', 'ocean', 'Rainbow', 'has', 'a', 'colour', '.']

Python教程

Python 教程

Python 教程

Tkinter 教程

Tkinter 教程

Pandas 教程

Pandas 教程

NumPy 教程

NumPy 教程

Flask 教程

Flask 教程

Django 教程

Django 教程

PySpark 教程

PySpark 教程

wxPython 教程

wxPython 教程

SymPy 教程

SymPy 教程

Seaborn 教程

Seaborn 教程

SciPy 教程

SciPy 教程

RxPY 教程

RxPY 教程

Pycharm 教程

Pycharm 教程

Pygame 教程

Pygame 教程

PyGTK 教程

PyGTK 教程

PyQt 教程

PyQt 教程

PyQt5 教程

PyQt5 教程

PyTorch 教程

PyTorch 教程

Matplotlib 教程

Matplotlib 教程

Web2py 教程

Web2py 教程

BeautifulSoup 教程

BeautifulSoup 教程

Java教程

Java 教程

Java 教程

Web教程

HTML 教程

HTML 教程

CSS 教程

CSS 教程

CSS3 教程

CSS3 教程

jQuery 教程

jQuery 教程

Ajax 教程

Ajax 教程

AngularJS 教程

AngularJS 教程

TypeScript 教程

TypeScript 教程

WordPress 教程

WordPress 教程

Laravel 教程

Laravel 教程

Next.js 教程

Next.js 教程

PhantomJS 教程

PhantomJS 教程

Three.js 教程

Three.js 教程

Underscore.JS 教程

Underscore.JS 教程

WebGL 教程

WebGL 教程

WebRTC 教程

WebRTC 教程

VueJS 教程

VueJS 教程

数据库教程

SQL 教程

SQL 教程

MySQL 教程

MySQL 教程

MongoDB 教程

MongoDB 教程

PostgreSQL 教程

PostgreSQL 教程

SQLite 教程

SQLite 教程

Redis 教程

Redis 教程

MariaDB 教程

MariaDB 教程

图形图像教程

Vulkan 教程

Vulkan 教程

OpenCV 教程

OpenCV 教程

大数据教程

R语言教程

R语言教程

开发工具教程

Git 教程

Git 教程

VSCode 教程

VSCode 教程

Docker 教程

Docker 教程

Gerrit 教程

Gerrit 教程

Excel 教程

Excel 教程

计算机教程

Go语言教程

Go语言教程

C++ 教程

C++ 教程

回顶
回顶部