Retrieve page numbers from document with pyPDF

python pypdf

The following worked for me:

from PyPDF2 import PdfFileReaderpdf = PdfFileReader(open('path/to/file.pdf','rb'))pdf.getNumPages()

python pypdf

The other answers use PyPDF/PyPDF2 which seems to read the entire file. This takes a long time for large files.

In the meantime I wrote something quick and dirty which doesn't take nearly as long to run. It does a shell call but I wasn't aware of any other way to do it. It can get the number of pages for pdfs that are ~5000 pages very quickly.

It works by just calling the "pdfinfo" shell command, so it probably only works in linux. I've only tested it on ubuntu so far.

One strange behavior I've seen is that surrounding this in a try/except block doesn't catch errors, you have to except subprocess.CalledProcessError.

from subprocess import check_outputdef get_num_pages(pdf_path):    output = check_output(["pdfinfo", pdf_path]).decode()    pages_line = [line for line in output.splitlines() if "Pages:" in line][0]    num_pages = int(pages_line.split(":")[1])    return num_pages

python pypdf

For full documentation, see Adobe's 978-page PDF Reference. :-)

More specifically, the PDF file contains metadata that indicates how the PDF's physical pages are mapped to logical page numbers and how page numbers should be formatted. This is where you go for canonical results. Example 2 of this page shows how this looks in the PDF markup. You'll have to fish that out, parse it, and perform a mapping yourself.

In PyPDF, to get at this information, try, as a starting point:

pdf.trailer["/Root"]["/PageLabels"]["/Nums"]

By the way, when you see an IndirectObject instance, you can call its getObject() method to retrieve the actual object being pointed to.

Your alternative is, as you say, to check the text objects and try to figure out which is the page number. You could use extractText() of the page object for this, but you'll get one string back and have to try to fish out the page number from that. (And of course the page number might be Roman or alphabetic instead of numeric, and some pages may not be numbered.) Instead, have a look at how extractText() actually does its job—PyPDF is written in Python, after all—and use it as a basis of a routine that checks each text object on the page individually to see if it's like a page number. Be wary of TOC/index pages that have lots of page numbers on them!

CodeHunter

Retrieve page numbers from document with pyPDF

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last