Python 处理PDF
Python可以读取PDF文件并从中提取文本后打印内容。为此,我们首先要安装所需的模块是 PyPDF2 。以下是安装该模块的命令。您的Python环境中应该已经安装了pip。
pip install pypdf2
安装成功后,我们可以使用模块中可用的方法来读取 PDF 文件。
import PyPDF2
pdfName = 'path\Tutorialspoint.pdf'
read_pdf = PyPDF2.PdfFileReader(pdfName)
page = read_pdf.getPage(0)
page_content = page.extractText()
print page_content
运行上面的程序时,我们得到以下输出 –
Tutorials Point originated from the idea that there exists a class of readers who respond better
to online content and prefer to learn new skills at their own pace from the comforts of their
drawing rooms.
The journey commenced with a single tutorial on HTML in 2006 and elated by the response
it generated, we worked our way to adding fresh tutorials to our repository which now
proudly flaunts a wealth of tutorials and allied articles on topics ranging from programming
languages to web designing to academics and much more.
阅读多个页面
要阅读带有多个页面并打印每个页面的页面编号的pdf,我们使用一个循环并配合getPageNumber()函数。在下面的示例中,我们有一个包含两个页面的PDF文件。内容将分别打印在两个独立的页面标题下。
import PyPDF2
pdfName = 'Path\Tutorialspoint2.pdf'
read_pdf = PyPDF2.PdfFileReader(pdfName)
for i in xrange(read_pdf.getNumPages()):
page = read_pdf.getPage(i)
print 'Page No - ' + str(1+read_pdf.getPageNumber(page))
page_content = page.extractText()
print page_content
当我们运行上面的程序时,我们得到以下输出:
Page No - 1
Tutorials Point originated from the idea that there exists a class of readers who respond better to
online content and prefer to learn new skills at their own pace from the comforts of their drawing
rooms.
Page No - 2
The journey commenced with a single tutorial on HTML in 2006 and elated by the response it
generated, we worked our way to adding fresh tutorials to our repository which now proudly flaunts
a wealth of tutorials and allied articles on topics ranging from p
rogramming languages to web
designing to academics and much more.