How to check if PDF is scanned image or contains text How to check if PDF is scanned image or contains text python python

How to check if PDF is scanned image or contains text


The below code will work, to extract data text data from both searchable and non-searchable PDF's.

import fitztext = ""path = "Your_scanned_or_partial_scanned.pdf"doc = fitz.open(path)for page in doc:    text += page.getText()

If you don't have fitz module you need to do this:

pip install --upgrade pymupdf


Building on top of Rahul Agarwal's solution, along with some snippets I found at this link, here is a possible algorithm that should solve your problem.

You need to install fitz and PyMuPDF modules. You can do it by means of pip:

pip3 install fitz PyMuPDF

And here is the Python3 implementation:

import fitzdef get_text_percentage(file_name: str) -> float:    """    Calculate the percentage of document that is covered by (searchable) text.    If the returned percentage of text is very low, the document is    most likely a scanned PDF    """    total_page_area = 0.0    total_text_area = 0.0    doc = fitz.open(file_name)    for page_num, page in enumerate(doc):        total_page_area = total_page_area + abs(page.rect)        text_area = 0.0        for b in page.getTextBlocks():            r = fitz.Rect(b[:4])  # rectangle where block text appears            text_area = text_area + abs(r)        total_text_area = total_text_area + text_area    doc.close()    return total_text_area / total_page_areaif __name__ == "__main__":    text_perc = get_text_percentage("my.pdf")    print(text_perc)    if text_perc < 0.01:        print("fully scanned PDF - no relevant text")    else:        print("not fully scanned PDF - text is present")

Although this answers your question (i.e. distinguish between fully scanned and full/partial textual PDFs), this solution is not able to distinguish between full-textual PDFs and scanned PDFs that also have text within them.


def get_pdf_searchable_pages(fname):    # pip install pdfminer    from pdfminer.pdfpage import PDFPage    searchable_pages = []    non_searchable_pages = []    page_num = 0    with open(fname, 'rb') as infile:        for page in PDFPage.get_pages(infile):            page_num += 1            if 'Font' in page.resources.keys():                searchable_pages.append(page_num)            else:                non_searchable_pages.append(page_num)    if page_num > 0:        if len(searchable_pages) == 0:            print(f"Document '{fname}' has {page_num} page(s). "                  f"Complete document is non-searchable")        elif len(non_searchable_pages) == 0:            print(f"Document '{fname}' has {page_num} page(s). "                  f"Complete document is searchable")        else:            print(f"searchable_pages : {searchable_pages}")            print(f"non_searchable_pages : {non_searchable_pages}")    else:        print(f"Not a valid document")if __name__ == '__main__':    get_pdf_searchable_pages("1.pdf")    get_pdf_searchable_pages("1Scanned.pdf")

Output:

Document '1.pdf' has 1 page(s). Complete document is searchableDocument '1Scanned.pdf' has 1 page(s). Complete document is non-searchable