How to read line by line in pdf file using PyPdf?
Looks like what you have is a large chunk of text data that you want to interpret line-by-line.
You can use the StringIO class to wrap that content as a seekable file-like object:
>>> import StringIO>>> content = 'big\nugly\ncontents\nof\nmultiple\npdf files'>>> buf = StringIO.StringIO(content)>>> buf.readline()'big\n'>>> buf.readline()'ugly\n'>>> buf.readline()'contents\n'>>> buf.readline()'of\n'>>> buf.readline()'multiple\n'>>> buf.readline()'pdf files'>>> buf.seek(0)>>> buf.readline()'big\n'
In your case, do:
from StringIO import StringIO# Read each line of the PDFpdfContent = StringIO(getPDFContent("test.pdf").encode("ascii", "ignore"))for line in pdfContent: doSomething(line.strip())
Using yield
and PdfFileReader.pages
can simplify things,
from pyPdf import PdfFileReaderdef get_pdf_content_lines(pdf_file_path): with open(pdf_file_path) as f: pdf_reader = PdfFileReader(f) for page in pdf_reader.pages: for line in page.extractText().splitlines(): yield linefor line in get_pdf_content_lines('/path/to/file.pdf'): print line
In addition, Some may google "python get pdf content text" so here's how: (this is how i got here)
from pyPdf import PdfFileReaderdef get_pdf_content(pdf_file_path): with open(pdf_file_path) as f: pdf_reader = PdfFileReader(f) content = "\n".join(page.extractText().strip() for page in pdf_reader.pages) content = ' '.join(content.split()) return contentprint get_pdf_content('/path/to/file.pdf')