Python 处理 Word 文档

要读取一个 Word 文档，我们需要使用名为 docx 的模块。首先按照下面的示例安装 docx。然后编写一个程序，使用 docx 模块中的不同函数来通过段落读取整个文件。

我们使用下面的命令将 docx 模块引入我们的环境中。

pip install docx

在下面的示例中，我们通过将每行追加到一个段落中来读取Word文档的内容，并最后打印出所有段落的文本。

import docx

def readtxt(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

print (readtxt('path\Tutorialspoint.docx'))

当我们运行上面的程序时，我们得到以下输出 –

Tutorials Point originated from the idea that there exists a class of readers who respond 
better to online content and prefer to learn new skills at their own pace from the comforts 
of their drawing rooms. 

The journey commenced with a single tutorial on HTML in 2006 and elated by the response it generated, 
we worked our way to adding fresh tutorials to our repository which now proudly flaunts 
a wealth of tutorials and allied articles on topics ranging from programming languages 
to web designing to academics and much more.

阅读单个段落

我们可以使用paragraphs属性从Word文档中读取特定的段落。在下面的示例中，我们只读取了Word文档中的第二个段落。

import docx

doc = docx.Document('path\Tutorialspoint.docx')
print len(doc.paragraphs)

print doc.paragraphs[2].text

运行上面的程序，我们得到以下输出 −

The journey commenced with a single tutorial on HTML in 2006 and elated by the response 
it generated, we worked our way to adding fresh tutorials to our repository 
which now proudly flaunts a wealth of tutorials and allied articles on topics 
ranging from programming languages to web designing to academics and much more.