How to extract text from a directory of PDF files efficiently with OCR? How to extract text from a directory of PDF files efficiently with OCR? python-3.x python-3.x

How to extract text from a directory of PDF files efficiently with OCR?


In your code, you are extracting the text, but you don't do anything with it.

Try something like this:

def extract_txt(file_path):    text = textract.process(file_path, method='tesseract')    outfn = file_path[:-4] + '.txt'  # assuming filenames end with '.pdf'    with open(outfn, 'wb') as output_file:        output_file.write(text)    return file_path

This writes the text to file that has the same name but a .txt extension.

It also returns the path of the original file to let the parent know that this file is done.

So I would change the mapping code to:

p = multiprocessing.Pool()file_path = ['/Users/user/Desktop/sample.pdf']for fn in p.imap_unordered(extract_txt, file_path):    print('completed file:', fn)
  • You don't need to give an argument when creating a Pool. By default it will create as many workers as there are cpu-cores.
  • Using imap_unordered creates an iterator that starts yielding values as soon as they are available.
  • Because the worker function returned the filename, you can print it to let the user know that this file is done.

Edit 1:

The additional question is if it is possible to mark page boundaries. I think it is.

A method that would surely work is to split the PDF file into pages before the OCR. You could use e.g. pdfinfo from the poppler-utils package to find out the number of pages in a document. And then you could use e.g. pdfseparate from the same poppler-utils package to convert that one pdf file of N pages into N pdf files of one page. You could then OCR the single page PDF files separately. That would give you the text on each page separately.

Alternatively you could OCR the whole document and then search for page breaks. This will only work if the document has a constant or predictable header or footer on every page. It is probably not as reliable as the abovementioned method.


Edit 2:

If you need a file, write a file:

from PyPDF2 import PdfFileWriter, PdfFileReaderimport textractdef extract_text(pdf_file):    inputpdf = PdfFileReader(open(pdf_file, "rb"))    outfname = pdf_file[:-4] + '.txt' # Assuming PDF file name ends with ".pdf"    with open(outfname, 'w') as textfile:        for i in range(inputpdf.numPages):            w = PdfFileWriter()            w.addPage(inputpdf.getPage(i))            outfname = 'page{:03d}.pdf'.format(i)            with open(outfname, 'wb') as outfile:  # I presume you need `wb`.                w.write(outfile)            print('page', i)            text = textract.process(outfname, method='tesseract')            # Add header and footer.            text = '\n<begin page pos = {}>\n'.format(i) + text + '\n<end page pos = {}>\n'.format(i)            # Write the OCR-ed text to the output file.            textfile.write(text)            os.remove(outfname)  # clean up.            print(text)