How to read line by line in pdf file using PyPdf? How to read line by line in pdf file using PyPdf? python python

How to read line by line in pdf file using PyPdf?


Looks like what you have is a large chunk of text data that you want to interpret line-by-line.

You can use the StringIO class to wrap that content as a seekable file-like object:

>>> import StringIO>>> content = 'big\nugly\ncontents\nof\nmultiple\npdf files'>>> buf = StringIO.StringIO(content)>>> buf.readline()'big\n'>>> buf.readline()'ugly\n'>>> buf.readline()'contents\n'>>> buf.readline()'of\n'>>> buf.readline()'multiple\n'>>> buf.readline()'pdf files'>>> buf.seek(0)>>> buf.readline()'big\n'

In your case, do:

from StringIO import StringIO# Read each line of the PDFpdfContent = StringIO(getPDFContent("test.pdf").encode("ascii", "ignore"))for line in pdfContent:    doSomething(line.strip())


import pyPdf  def getPDFContent(path):    content = ""    num_pages = 10    p = file(path, "rb")    pdf = pyPdf.PdfFileReader(p)    for i in range(0, num_pages):        content += pdf.getPage(i).extractText() + "\n"    content = " ".join(content.replace(u"\xa0", " ").strip().split())         return content 


Using yield and PdfFileReader.pages can simplify things,

from pyPdf import PdfFileReaderdef get_pdf_content_lines(pdf_file_path):    with open(pdf_file_path) as f:        pdf_reader = PdfFileReader(f)        for page in pdf_reader.pages:             for line in page.extractText().splitlines():                yield linefor line in get_pdf_content_lines('/path/to/file.pdf'):    print line

In addition, Some may google "python get pdf content text" so here's how: (this is how i got here)

from pyPdf import PdfFileReaderdef get_pdf_content(pdf_file_path):    with open(pdf_file_path) as f:        pdf_reader = PdfFileReader(f)        content = "\n".join(page.extractText().strip() for page in pdf_reader.pages)        content = ' '.join(content.split())        return contentprint get_pdf_content('/path/to/file.pdf')