PDF compare on linux command line

linux pdf comparison ghostscript

I've written my own script that does something similar to what you're asking for. The script uses 4 tools to achieve its goal:

ImageMagick's compare command
the pdftk utility (if you have multipage PDFs)
Ghostscript (optional)
md5sum (optional)

It should be quite easy to port this to a .bat batch file for DOS/Windows.

But first, please note: this only works well for PDFs which have the same page/media size. The comparison is done pixel by pixel between the two input PDFs. The resulting file is an image showing the "diff" like this:

Each pixel that remains unchanged becomes white.
Each pixel that got changed is painted in red.

That diff image is saved as a new PDF to make it better accessible on different OS platforms.

I'm using this for example to discover minimal page display differences when font substitution in PDF processing comes into play.

It could happen, that there is no visible difference between your PDFs, though they are different in MD5 hashes and/or file size. In this case the "diff" output PDF page would become all-white. You could automatically discover this condition, so you only have to visually investigate the non-white PDFs by deleting the all-white ones automatically.

Here are the building blocks:

pdftk

Use this command line utility to split multipage PDF files into multiple singlepage PDFs:

pdftk  file_1.pdf  burst  output  somewhere/file_1---page_%03d.pdfpdftk  file_2.pdf  burst  output  somewhere/file_2---page_%03d.pdf

If you are comparing 1-page PDFs only, this building block is optional. Since you talk about "construction plans", this is likely the case.

compare

Use this command line utility from ImageMagick to create a "diff" PDF page for each of the pages:

compare \       -verbose \       -debug coder \       -log "%u %m:%l %e" \        somewhere/file_1---page_001.pdf \        somewhere/file_2---page_001.pdf \       -compose src \        somewhereelse/file_1--file_2---diff_page_001.pdf

Ghostscript

Because of automatically inserted meta data (such as the current date+time), PDF output is not working well for MD5hash-based file comparisons.

If you want to automatically discover all cases where the diff PDF consist of a purely white page, you should convert the PDF page to a meta-data free bitmap format using the bmp256 output device. You can do that like this:

First, find out what the page size format of your PDF is. Again, this little utility identify comes as part of any ImageMagick installation:

 identify \   -format "%[fx:(w)]x%[fx:(h)]" \    somewhereelse/file_1--file_2---diff_page_001.pdf

You can store this value in an environment variable like this:

 export my_size=$(identify \   -format "%[fx:(w)]x%[fx:(h)]" \    somewhereelse/file_1--file_2---diff_page_001.pdf)

Now Ghostscript comes into play, using a commandline which includes the above discovered page size as it is stored in the variable:

 gs \   -o somewhereelse/file_1--file_2---diff_page_001.ppm \   -sDEVICE=ppmraw \   -r72 \   -g${my_size} \    somewhereelse/file_1--file_2---diff_page_001.pdf

This gives you a PPM (Portable PixMap) with a resolution of 72 dpi from the original PDF page. 72 dpi usually is good enough for what we want... Next, create a purely white PPM page with the same page size:

 gs \   -o somewhereelse/file_1--file_2---whitepage_001.ppm \   -sDEVICE=ppmraw \   -r72 \   -g${my_size} \   -c "showpage"

The -c "showpage" part is a PostScript command that tells Ghostscript to emit an empty page only.

MD5 sum

Use the MD5 hash to automatically compare the original PPM with the whitepage PPM. In case they are the same, you can savely assume that there are no differences between the PDFs and therefore rename or delete the diff-PDF:

 MD5_1=$(md5sum somewhereelse/file_1--file_2---diff_page_001.ppm | awk '{print $1}') MD5_2=$(md5sum somewhereelse/file_1--file_2---whitepage_001.ppm | awk '{print $1}') if [ "x${MD5_1}" == "x${MD5_2}" ]; then      mv  \       somewhereelse/file_1--file_2---diff_page_001.pdf \       somewhereelse/file_1--file_2---NODIFFERENCE_page_001.pdf # rename all-white PDF     rm  \       somewhereelse/file_1--file_2---*_page_001.ppm            # delete both PPMs fi

This spares you from having to visually inspect "diff PDFs" that do not have any differences.

linux pdf comparison ghostscript

Here is a hack to do it.

pdftotext file1.pdfpdftotext file2.pdfdiff file1.txt file2.txt

linux pdf comparison ghostscript

Done in 2 lines with (the allmighty) imagemagick and pdftk:

compare -verbose -debug coder $PDF_1 $PDF_2 -compose src $OUT_FILE.tmppdftk $OUT_FILE.tmp background $PDF_1 output $OUT_FILE

The options -verbose and -debug are optional.

compare creates a PDF with the diff as red pixels.
pdftk merges the diff-pdf with background PDF_1

CodeHunter

PDF compare on linux command line

pdftk

compare

Ghostscript

MD5 sum

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last