How to extract PDF fields from a filled out form in Python? How to extract PDF fields from a filled out form in Python? python python

How to extract PDF fields from a filled out form in Python?


You should be able to do it with pdfminer, but it will require some delving into the internals of pdfminer and some knowledge about the pdf format (wrt forms of course, but also about pdf's internal structures like "dictionaries" and "indirect objects").

This example might help you on your way (I think it will work only on simple cases, with no nested fields etc...)

import sysfrom pdfminer.pdfparser import PDFParserfrom pdfminer.pdfdocument import PDFDocumentfrom pdfminer.pdftypes import resolve1filename = sys.argv[1]fp = open(filename, 'rb')parser = PDFParser(fp)doc = PDFDocument(parser)fields = resolve1(doc.catalog['AcroForm'])['Fields']for i in fields:    field = resolve1(i)    name, value = field.get('T'), field.get('V')    print '{0}: {1}'.format(name, value)

EDIT: forgot to mention: if you need to provide a password, pass it to doc.initialize()


Python 3.6+:

pip install PyPDF2

# -*- coding: utf-8 -*-from collections import OrderedDictfrom PyPDF2 import PdfFileWriter, PdfFileReaderdef _getFields(obj, tree=None, retval=None, fileobj=None):    """    Extracts field data if this PDF contains interactive form fields.    The *tree* and *retval* parameters are for recursive use.    :param fileobj: A file object (usually a text file) to write        a report to on all interactive form fields found.    :return: A dictionary where each key is a field name, and each        value is a :class:`Field<PyPDF2.generic.Field>` object. By        default, the mapping name is used for keys.    :rtype: dict, or ``None`` if form data could not be located.    """    fieldAttributes = {'/FT': 'Field Type', '/Parent': 'Parent', '/T': 'Field Name', '/TU': 'Alternate Field Name',                       '/TM': 'Mapping Name', '/Ff': 'Field Flags', '/V': 'Value', '/DV': 'Default Value'}    if retval is None:        retval = OrderedDict()        catalog = obj.trailer["/Root"]        # get the AcroForm tree        if "/AcroForm" in catalog:            tree = catalog["/AcroForm"]        else:            return None    if tree is None:        return retval    obj._checkKids(tree, retval, fileobj)    for attr in fieldAttributes:        if attr in tree:            # Tree is a field            obj._buildField(tree, retval, fileobj, fieldAttributes)            break    if "/Fields" in tree:        fields = tree["/Fields"]        for f in fields:            field = f.getObject()            obj._buildField(field, retval, fileobj, fieldAttributes)    return retvaldef get_form_fields(infile):    infile = PdfFileReader(open(infile, 'rb'))    fields = _getFields(infile)    return OrderedDict((k, v.get('/V', '')) for k, v in fields.items())if __name__ == '__main__':    from pprint import pprint    pdf_file_name = 'FormExample.pdf'    pprint(get_form_fields(pdf_file_name))


The Python PyPDF2 package (successor to pyPdf) is very convenient:

import PyPDF2f = PyPDF2.PdfFileReader('form.pdf')ff = f.getFields()

Then ff is a dict that contains all the relevant form information.