Extracting text from a PDF file using PDFMiner in python?

python python-3.x python-2.7 text-extraction pdfminer

Here is a working example of extracting text from a PDF file using the current version of PDFMiner(September 2016)

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreterfrom pdfminer.converter import TextConverterfrom pdfminer.layout import LAParamsfrom pdfminer.pdfpage import PDFPagefrom io import StringIOdef convert_pdf_to_txt(path):    rsrcmgr = PDFResourceManager()    retstr = StringIO()    codec = 'utf-8'    laparams = LAParams()    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)    fp = open(path, 'rb')    interpreter = PDFPageInterpreter(rsrcmgr, device)    password = ""    maxpages = 0    caching = True    pagenos=set()    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):        interpreter.process_page(page)    text = retstr.getvalue()    fp.close()    device.close()    retstr.close()    return text

PDFMiner's structure changed recently, so this should work for extracting text from the PDF files.

Edit : Still working as of the June 7th of 2018. Verified in Python Version 3.x

Edit: The solution works with Python 3.7 at October 3, 2019. I used the Python library pdfminer.six, released on November 2018.

python python-3.x python-2.7 text-extraction pdfminer

terrific answer from DuckPuncher, for Python3 make sure you install pdfminer2 and do:

import iofrom pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreterfrom pdfminer.converter import TextConverterfrom pdfminer.layout import LAParamsfrom pdfminer.pdfpage import PDFPagedef convert_pdf_to_txt(path):    rsrcmgr = PDFResourceManager()    retstr = io.StringIO()    codec = 'utf-8'    laparams = LAParams()    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)    fp = open(path, 'rb')    interpreter = PDFPageInterpreter(rsrcmgr, device)    password = ""    maxpages = 0    caching = True    pagenos = set()    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages,                                  password=password,                                  caching=caching,                                  check_extractable=True):        interpreter.process_page(page)    fp.close()    device.close()    text = retstr.getvalue()    retstr.close()    return text

python python-3.x python-2.7 text-extraction pdfminer

This works in May 2020 using PDFminer six in Python3.

Installing the package

$ pip install pdfminer.six

Importing the package

from pdfminer.high_level import extract_text

Using a PDF saved on disk

text = extract_text('report.pdf')

Or alternatively:

with open('report.pdf','rb') as f:    text = extract_text(f)

Using PDF already in memory

If the PDF is already in memory, for example if retrieved from the web with the requests library, it can be converted to a stream using the io library:

import ioresponse = requests.get(url)text = extract_text(io.BytesIO(response.content))

Performance and Reliability compared with PyPDF2

PDFminer.six works more reliably than PyPDF2 (which fails with certain types of PDFs), in particular PDF version 1.7

However, text extraction with PDFminer.six is significantly slower than PyPDF2 by a factor of 6.

I timed text extraction with timeit on a 15" MBP (2018), timing only the extraction function (no file opening etc.) with a 10 page PDF and got the following results:

PDFminer.six: 2.88 secPyPDF2:       0.45 sec

pdfminer.six also has a huge footprint, requiring pycryptodome which needs GCC and other things installed pushing a minimal install docker image on Alpine Linux from 80 MB to 350 MB. PyPDF2 has no noticeable storage impact.

CodeHunter

Extracting text from a PDF file using PDFMiner in python?

Installing the package

Importing the package

Using a PDF saved on disk

Using PDF already in memory

Performance and Reliability compared with PyPDF2

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last