How to extract PDF fields from a filled out form in Python?

python forms pdf

You should be able to do it with pdfminer, but it will require some delving into the internals of pdfminer and some knowledge about the pdf format (wrt forms of course, but also about pdf's internal structures like "dictionaries" and "indirect objects").

This example might help you on your way (I think it will work only on simple cases, with no nested fields etc...)

import sysfrom pdfminer.pdfparser import PDFParserfrom pdfminer.pdfdocument import PDFDocumentfrom pdfminer.pdftypes import resolve1filename = sys.argv[1]fp = open(filename, 'rb')parser = PDFParser(fp)doc = PDFDocument(parser)fields = resolve1(doc.catalog['AcroForm'])['Fields']for i in fields:    field = resolve1(i)    name, value = field.get('T'), field.get('V')    print '{0}: {1}'.format(name, value)

EDIT: forgot to mention: if you need to provide a password, pass it to doc.initialize()

python forms pdf

Python 3.6+:

pip install PyPDF2

# -*- coding: utf-8 -*-from collections import OrderedDictfrom PyPDF2 import PdfFileWriter, PdfFileReaderdef _getFields(obj, tree=None, retval=None, fileobj=None):    """    Extracts field data if this PDF contains interactive form fields.    The *tree* and *retval* parameters are for recursive use.    :param fileobj: A file object (usually a text file) to write        a report to on all interactive form fields found.    :return: A dictionary where each key is a field name, and each        value is a :class:`Field<PyPDF2.generic.Field>` object. By        default, the mapping name is used for keys.    :rtype: dict, or ``None`` if form data could not be located.    """    fieldAttributes = {'/FT': 'Field Type', '/Parent': 'Parent', '/T': 'Field Name', '/TU': 'Alternate Field Name',                       '/TM': 'Mapping Name', '/Ff': 'Field Flags', '/V': 'Value', '/DV': 'Default Value'}    if retval is None:        retval = OrderedDict()        catalog = obj.trailer["/Root"]        # get the AcroForm tree        if "/AcroForm" in catalog:            tree = catalog["/AcroForm"]        else:            return None    if tree is None:        return retval    obj._checkKids(tree, retval, fileobj)    for attr in fieldAttributes:        if attr in tree:            # Tree is a field            obj._buildField(tree, retval, fileobj, fieldAttributes)            break    if "/Fields" in tree:        fields = tree["/Fields"]        for f in fields:            field = f.getObject()            obj._buildField(field, retval, fileobj, fieldAttributes)    return retvaldef get_form_fields(infile):    infile = PdfFileReader(open(infile, 'rb'))    fields = _getFields(infile)    return OrderedDict((k, v.get('/V', '')) for k, v in fields.items())if __name__ == '__main__':    from pprint import pprint    pdf_file_name = 'FormExample.pdf'    pprint(get_form_fields(pdf_file_name))

python forms pdf

The Python PyPDF2 package (successor to pyPdf) is very convenient:

import PyPDF2f = PyPDF2.PdfFileReader('form.pdf')ff = f.getFields()

Then ff is a dict that contains all the relevant form information.

CodeHunter

How to extract PDF fields from a filled out form in Python?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last