How to remove the lines which appear on file B from another file A? How to remove the lines which appear on file B from another file A? shell shell

How to remove the lines which appear on file B from another file A?


If the files are sorted (they are in your example):

comm -23 file1 file2

-23 suppresses the lines that are in both files, or only in file 2. If the files are not sorted, pipe them through sort first...

See the man page here


grep -Fvxf <lines-to-remove> <all-lines>

  • works on non-sorted files
  • maintains the order
  • is POSIX

Example:

cat <<EOF > Ab1a001b1EOFcat <<EOF > B01EOFgrep -Fvxf B A

Output:

ba01b

Explanation:

  • -F: use literal strings instead of the default BRE
  • -x: only consider matches that match the entire line
  • -v: print non-matching
  • -f file: take patterns from the given file

This method is slower on pre-sorted files than other methods, since it is more general. If speed matters as well, see: Fast way of finding lines in one file that are not in another?

Here's a quick bash automation for in-line operation:

remove-lines() (  remove_lines="$1"  all_lines="$2"  tmp_file="$(mktemp)"  grep -Fvxf "$remove_lines" "$all_lines" > "$tmp_file"  mv "$tmp_file" "$all_lines")

GitHub upstream.

usage:

remove-lines lines-to-remove remove-from-this-file

See also: https://unix.stackexchange.com/questions/28158/is-there-a-tool-to-get-the-lines-in-one-file-that-are-not-in-another


awk to the rescue!

This solution doesn't require sorted inputs. You have to provide fileB first.

awk 'NR==FNR{a[$0];next} !($0 in a)' fileB fileA

returns

AC

How does it work?

NR==FNR{a[$0];next} idiom is for storing the first file in an associative array as keys for a later "contains" test.

NR==FNR is checking whether we're scanning the first file, where the global line counter (NR) equals to the current file line counter (FNR).

a[$0] adds the current line to the associative array as key, note that this behaves like a set, where there won't be any duplicate values (keys)

!($0 in a) we're now in the next file(s), in is a contains test, here it's checking whether current line is in the set we populated in the first step from the first file, ! negates the condition. What is missing here is the action, which by default is {print} and usually not written explicitly.

Note that this can now be used to remove blacklisted words.

$ awk '...' badwords allwords > goodwords

with a slight change it can clean multiple lists and create cleaned versions.

$ awk 'NR==FNR{a[$0];next} !($0 in a){print > FILENAME".clean"}' bad file1 file2 file3 ...