Extract images from PDF without resampling, in python? Extract images from PDF without resampling, in python? python python

Extract images from PDF without resampling, in python?


You can use the module PyMuPDF. This outputs all images as .png files, but worked out of the box and is fast.

import fitzdoc = fitz.open("file.pdf")for i in range(len(doc)):    for img in doc.getPageImageList(i):        xref = img[0]        pix = fitz.Pixmap(doc, xref)        if pix.n < 5:       # this is GRAY or RGB            pix.writePNG("p%s-%s.png" % (i, xref))        else:               # CMYK: convert to RGB first            pix1 = fitz.Pixmap(fitz.csRGB, pix)            pix1.writePNG("p%s-%s.png" % (i, xref))            pix1 = None        pix = None

see here for more resources


In Python with PyPDF2 and Pillow libraries it is simple:

import PyPDF2from PIL import Imageif __name__ == '__main__':    input1 = PyPDF2.PdfFileReader(open("input.pdf", "rb"))    page0 = input1.getPage(0)    xObject = page0['/Resources']['/XObject'].getObject()    for obj in xObject:        if xObject[obj]['/Subtype'] == '/Image':            size = (xObject[obj]['/Width'], xObject[obj]['/Height'])            data = xObject[obj].getData()            if xObject[obj]['/ColorSpace'] == '/DeviceRGB':                mode = "RGB"            else:                mode = "P"            if xObject[obj]['/Filter'] == '/FlateDecode':                img = Image.frombytes(mode, size, data)                img.save(obj[1:] + ".png")            elif xObject[obj]['/Filter'] == '/DCTDecode':                img = open(obj[1:] + ".jpg", "wb")                img.write(data)                img.close()            elif xObject[obj]['/Filter'] == '/JPXDecode':                img = open(obj[1:] + ".jp2", "wb")                img.write(data)                img.close()


Often in a PDF, the image is simply stored as-is. For example, a PDF with a jpg inserted will have a range of bytes somewhere in the middle that when extracted is a valid jpg file. You can use this to very simply extract byte ranges from the PDF. I wrote about this some time ago, with sample code: Extracting JPGs from PDFs.