同时对一个文件夹中的所有图像进行Python OCR识别
如果你有一个装满图片的文件夹,其中有一些文字需要提取到一个单独的文件夹中,并有相应的图片文件名或在一个单一的文件中,那么这就是你正在寻找的完美代码。
这篇文章不仅给你提供了OCR(光学字符识别)的基础,而且还帮助你为主文件夹内的每张图片创建output.txt文件,并将其保存在某个预定的方向。
需要的库 –
pip3 install pillow
pip3 install os-sys
你还需要tesseract-oct和pytesseract库。tesseract-ocr可以从这里下载和安装,pytesseract可以用pip3 install pytesseract来安装。
下面是Python的实现-
# Python program to extract text from all the images in a folder
# storing the text in corresponding files in a different folder
from PIL import Image
import pytesseract as pt
import os
def main():
# path for the folder for getting the raw images
path ="E:\\GeeksforGeeks\\images"
# path for the folder for getting the output
tempPath ="E:\\GeeksforGeeks\\textFiles"
# iterating the images inside the folder
for imageName in os.listdir(path):
inputPath = os.path.join(path, imageName)
img = Image.open(inputPath)
# applying ocr using pytesseract for python
text = pt.image_to_string(img, lang ="eng")
# for removing the .jpg from the imagePath
imagePath = imagePath[0:-4]
fullTempPath = os.path.join(tempPath, 'time_'+imageName+".txt")
print(text)
# saving the text for every image in a separate .txt file
file1 = open(fullTempPath, "w")
file1.write(text)
file1.close()
if __name__ == '__main__':
main()
输入图片:
image_sample1
输出 :
geeksforgeeks
geeksforgeeks
如果你想把所有图片中的文本存储在一个单一的输出文件中,那么代码就会有一些不同。主要的区别是,我们要写的文件的模式将改为 “+a”,以追加文本,并创建output.txt文件,如果它还没有存在的话。
# extract text from all the images in a folder
# storing the text in a single file
from PIL import Image
import pytesseract as pt
import os
def main():
# path for the folder for getting the raw images
path ="E:\\GeeksforGeeks\\images"
# link to the file in which output needs to be kept
fullTempPath ="E:\\GeeksforGeeks\\output\\outputFile.txt"
# iterating the images inside the folder
for imageName in os.listdir(path):
inputPath = os.path.join(path, imageName)
img = Image.open(inputPath)
# applying ocr using pytesseract for python
text = pt.image_to_string(img, lang ="eng")
# saving the text for appending it to the output.txt file
# a + parameter used for creating the file if not present
# and if present then append the text content
file1 = open(fullTempPath, "a+")
# providing the name of the image
file1.write(imageName+"\n")
# providing the content in the image
file1.write(text+"\n")
file1.close()
# for printing the output file
file2 = open(fullTempPath, 'r')
print(file2.read())
file2.close()
if __name__ == '__main__':
main()
输入图像:
image_sample1
image_sample2
输出:
它给出了一个从文件夹内的图像中提取所有信息后创建的单一文件的输出。该文件的格式是这样的 –
Name of the image
Content of the image
Name of the next image and so on .....