Python – 处理PDF

Python – 处理PDF

Python可以读取PDF文件并从中提取文本内容后输出。为此,我们必须首先安装所需的模块——PyPDF2。以下是安装该模块的命令。您应该已经在Python环境中安装了pip。

pip install pypdf2

成功安装此模块后,我们可以使用模块中可用的方法来读取PDF文件。

import PyPDF2

pdfName = 'path\Tutorialspoint.pdf'
read_pdf = PyPDF2.PdfFileReader(pdfName)
page = read_pdf.getPage(0)
page_content = page.extractText()
print(page_content)

当运行上面的程序时,我们获得以下输出 −

Tutorials Point originated from the idea that there exists a class of readers who respond better 
to online content and prefer to learn new skills at their own pace from the comforts of their 
drawing rooms.

The journey commenced with a single tutorial on HTML in 2006 and elated by the response 
it generated, we worked our way to adding fresh tutorials to our repository which now 
proudly flaunts a wealth of tutorials and allied articles on topics ranging from programming
languages to web designing to academics and much more.

读取多个页面

为了读取具有多个页面的PDF并打印出每个页面及其页码,我们使用一个循环和getPageNumber()函数。在下面的示例中,我们使用具有两个页面的PDF文件。内容按照两个单独的页面标题打印。

import PyPDF2

pdfName = 'Path\Tutorialspoint2.pdf'
read_pdf = PyPDF2.PdfFileReader(pdfName)

for i in range(read_pdf.getNumPages()):
    page = read_pdf.getPage(i)
    print('Page No - ' + str(1+read_pdf.getPageNumber(page)))
    page_content = page.extractText()
    print(page_content)

当运行上面的程序时,我们获得以下输出 −

Page No - 1
Tutorials Point originated from the idea that there exists a class of readers who respond better to 
online content and prefer to learn new skills at their own pace from the comforts of their drawing 
rooms. 


Page No - 2

The journey commenced with a single tutorial on HTML in 2006 and elated by the response it 
generated, we worked our way to adding fresh tutorials to our repository which now proudly flaunts 
a wealth of tutorials and allied articles on topics ranging from programming languages to web 
designing to academics and much more.

Python教程

Java教程

Web教程

数据库教程

图形图像教程

大数据教程

开发工具教程

计算机教程