How to check if PDF is scanned image or contains text

python python-3.x pypdf2 pdfminer pdf-extraction

The below code will work, to extract data text data from both searchable and non-searchable PDF's.

import fitztext = ""path = "Your_scanned_or_partial_scanned.pdf"doc = fitz.open(path)for page in doc:    text += page.getText()

If you don't have fitz module you need to do this:

pip install --upgrade pymupdf

python python-3.x pypdf2 pdfminer pdf-extraction

Building on top of Rahul Agarwal's solution, along with some snippets I found at this link, here is a possible algorithm that should solve your problem.

You need to install fitz and PyMuPDF modules. You can do it by means of pip:

pip3 install fitz PyMuPDF

And here is the Python3 implementation:

import fitzdef get_text_percentage(file_name: str) -> float:    """    Calculate the percentage of document that is covered by (searchable) text.    If the returned percentage of text is very low, the document is    most likely a scanned PDF    """    total_page_area = 0.0    total_text_area = 0.0    doc = fitz.open(file_name)    for page_num, page in enumerate(doc):        total_page_area = total_page_area + abs(page.rect)        text_area = 0.0        for b in page.getTextBlocks():            r = fitz.Rect(b[:4])  # rectangle where block text appears            text_area = text_area + abs(r)        total_text_area = total_text_area + text_area    doc.close()    return total_text_area / total_page_areaif __name__ == "__main__":    text_perc = get_text_percentage("my.pdf")    print(text_perc)    if text_perc < 0.01:        print("fully scanned PDF - no relevant text")    else:        print("not fully scanned PDF - text is present")

Although this answers your question (i.e. distinguish between fully scanned and full/partial textual PDFs), this solution is not able to distinguish between full-textual PDFs and scanned PDFs that also have text within them.

python python-3.x pypdf2 pdfminer pdf-extraction

def get_pdf_searchable_pages(fname):    # pip install pdfminer    from pdfminer.pdfpage import PDFPage    searchable_pages = []    non_searchable_pages = []    page_num = 0    with open(fname, 'rb') as infile:        for page in PDFPage.get_pages(infile):            page_num += 1            if 'Font' in page.resources.keys():                searchable_pages.append(page_num)            else:                non_searchable_pages.append(page_num)    if page_num > 0:        if len(searchable_pages) == 0:            print(f"Document '{fname}' has {page_num} page(s). "                  f"Complete document is non-searchable")        elif len(non_searchable_pages) == 0:            print(f"Document '{fname}' has {page_num} page(s). "                  f"Complete document is searchable")        else:            print(f"searchable_pages : {searchable_pages}")            print(f"non_searchable_pages : {non_searchable_pages}")    else:        print(f"Not a valid document")if __name__ == '__main__':    get_pdf_searchable_pages("1.pdf")    get_pdf_searchable_pages("1Scanned.pdf")

Output:

Document '1.pdf' has 1 page(s). Complete document is searchableDocument '1Scanned.pdf' has 1 page(s). Complete document is non-searchable

CodeHunter

How to check if PDF is scanned image or contains text

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last