Percentage value with GNU Diff Percentage value with GNU Diff unix unix

Percentage value with GNU Diff


Something like this perhaps?

Two files, A1 and A2.

$ sdiff -B -b -s A1 A2 | wc would give you how many lines differed. wc gives total, just divide.

The -b and -B are to ignore blanks and blank lines, and -s says to suppress the common lines.


Here's a script that will compare all .txt files and display the ones that have more than 15% duplication:

#!/bin/bash# walk through all files in the current dir (and subdirs)# and compare them with other files, showing percentage# of duplication.# which type files to compare?# (wouldn't make sense to compare binary formats)ext="txt"# support filenames with spaces:IFS=$(echo -en "\n\b")working_dir="$PWD"working_dir_name=$(echo $working_dir | sed 's|.*/||')all_files="$working_dir/../$working_dir_name-filelist.txt"remaining_files="$working_dir/../$working_dir_name-remaining.txt"# get information about files:find -type f -print0 | xargs -0 stat -c "%s %n" | grep -v "/\." | \     grep "\.$ext" | sort -nr > $all_filescp $all_files $remaining_fileswhile read string; do    fileA=$(echo $string | sed 's/.[^.]*\./\./')    tail -n +2 "$remaining_files" > $remaining_files.temp    mv $remaining_files.temp $remaining_files    # remove empty lines since they produce false positives    sed '/^$/d' $fileA > tempA    echo Comparing $fileA with other files...    while read string; do        fileB=$(echo $string | sed 's/.[^.]*\./\./')        sed '/^$/d' $fileB > tempB        A_len=$(cat tempA | wc -l)        B_len=$(cat tempB | wc -l)        differences=$(sdiff -B -s tempA tempB | wc -l)        common=$(expr $A_len - $differences)        percentage=$(echo "100 * $common / $B_len" | bc)        if [[ $percentage -gt 15 ]]; then            echo "  $percentage% duplication in" \                 "$(echo $fileB | sed 's|\./||')"        fi    done < "$remaining_files"    echo " "done < "$all_files"rm tempArm tempBrm $all_filesrm $remaining_files