How to read line by line in pdf file using PyPdf?

Looks like what you have is a large chunk of text data that you want to interpret line-by-line.

You can use the StringIO class to wrap that content as a seekable file-like object:

>>> import StringIO>>> content = 'big\nugly\ncontents\nof\nmultiple\npdf files'>>> buf = StringIO.StringIO(content)>>> buf.readline()'big\n'>>> buf.readline()'ugly\n'>>> buf.readline()'contents\n'>>> buf.readline()'of\n'>>> buf.readline()'multiple\n'>>> buf.readline()'pdf files'>>> buf.seek(0)>>> buf.readline()'big\n'

In your case, do:

from StringIO import StringIO# Read each line of the PDFpdfContent = StringIO(getPDFContent("test.pdf").encode("ascii", "ignore"))for line in pdfContent:    doSomething(line.strip())

python pdf pypdf

import pyPdf  def getPDFContent(path):    content = ""    num_pages = 10    p = file(path, "rb")    pdf = pyPdf.PdfFileReader(p)    for i in range(0, num_pages):        content += pdf.getPage(i).extractText() + "\n"    content = " ".join(content.replace(u"\xa0", " ").strip().split())         return content

python pdf pypdf

Using yield and PdfFileReader.pages can simplify things,

from pyPdf import PdfFileReaderdef get_pdf_content_lines(pdf_file_path):    with open(pdf_file_path) as f:        pdf_reader = PdfFileReader(f)        for page in pdf_reader.pages:             for line in page.extractText().splitlines():                yield linefor line in get_pdf_content_lines('/path/to/file.pdf'):    print line

In addition, Some may google "python get pdf content text" so here's how: (this is how i got here)

from pyPdf import PdfFileReaderdef get_pdf_content(pdf_file_path):    with open(pdf_file_path) as f:        pdf_reader = PdfFileReader(f)        content = "\n".join(page.extractText().strip() for page in pdf_reader.pages)        content = ' '.join(content.split())        return contentprint get_pdf_content('/path/to/file.pdf')

CodeHunter

How to read line by line in pdf file using PyPdf?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last