How to check if PDF is scanned image or contains text
The below code will work, to extract data text data from both searchable and non-searchable PDF's.
import fitztext = ""path = "Your_scanned_or_partial_scanned.pdf"doc = fitz.open(path)for page in doc: text += page.getText()
If you don't have fitz
module you need to do this:
pip install --upgrade pymupdf
Building on top of Rahul Agarwal's solution, along with some snippets I found at this link, here is a possible algorithm that should solve your problem.
You need to install fitz
and PyMuPDF
modules. You can do it by means of pip
:
pip3 install fitz PyMuPDF
And here is the Python3 implementation:
import fitzdef get_text_percentage(file_name: str) -> float: """ Calculate the percentage of document that is covered by (searchable) text. If the returned percentage of text is very low, the document is most likely a scanned PDF """ total_page_area = 0.0 total_text_area = 0.0 doc = fitz.open(file_name) for page_num, page in enumerate(doc): total_page_area = total_page_area + abs(page.rect) text_area = 0.0 for b in page.getTextBlocks(): r = fitz.Rect(b[:4]) # rectangle where block text appears text_area = text_area + abs(r) total_text_area = total_text_area + text_area doc.close() return total_text_area / total_page_areaif __name__ == "__main__": text_perc = get_text_percentage("my.pdf") print(text_perc) if text_perc < 0.01: print("fully scanned PDF - no relevant text") else: print("not fully scanned PDF - text is present")
Although this answers your question (i.e. distinguish between fully scanned and full/partial textual PDFs), this solution is not able to distinguish between full-textual PDFs and scanned PDFs that also have text within them.
def get_pdf_searchable_pages(fname): # pip install pdfminer from pdfminer.pdfpage import PDFPage searchable_pages = [] non_searchable_pages = [] page_num = 0 with open(fname, 'rb') as infile: for page in PDFPage.get_pages(infile): page_num += 1 if 'Font' in page.resources.keys(): searchable_pages.append(page_num) else: non_searchable_pages.append(page_num) if page_num > 0: if len(searchable_pages) == 0: print(f"Document '{fname}' has {page_num} page(s). " f"Complete document is non-searchable") elif len(non_searchable_pages) == 0: print(f"Document '{fname}' has {page_num} page(s). " f"Complete document is searchable") else: print(f"searchable_pages : {searchable_pages}") print(f"non_searchable_pages : {non_searchable_pages}") else: print(f"Not a valid document")if __name__ == '__main__': get_pdf_searchable_pages("1.pdf") get_pdf_searchable_pages("1Scanned.pdf")
Output:
Document '1.pdf' has 1 page(s). Complete document is searchableDocument '1Scanned.pdf' has 1 page(s). Complete document is non-searchable