how to detect invalid utf8 unicode/binary in a text file how to detect invalid utf8 unicode/binary in a text file bash bash

how to detect invalid utf8 unicode/binary in a text file


Assuming you have your locale set to UTF-8 (see locale output), this works well to recognize invalid UTF-8 sequences:

grep -axv '.*' file.txt

Explanation (from grep man page):

  • -a, --text: treats file as text, essential prevents grep to abort once finding an invalid byte sequence (not being utf8)
  • -v, --invert-match: inverts the output showing lines not matched
  • -x '.*' (--line-regexp): means to match a complete line consisting of any utf8 character.

Hence, there will be output, which is the lines containing the invalid not utf8 byte sequence containing lines (since inverted -v)


I would grep for non ASCII characters.

With GNU grep with pcre (due to -P, not available always. On FreeBSD you can use pcregrep in package pcre2) you can do:

grep -P "[\x80-\xFF]" file

Reference in How Do I grep For all non-ASCII Characters in UNIX. So, in fact, if you only want to check whether the file contains non ASCII characters, you can just say:

if grep -qP "[\x80-\xFF]" file ; then echo "file contains ascii"; fi#        ^#        silent grep

To remove these characters, you can use:

sed -i.bak 's/[\d128-\d255]//g' file

This will create a file.bak file as backup, whereas the original file will have its non ASCII characters removed. Reference in Remove non-ascii characters from csv.


Try this, in order to find non-ASCII characters from the shell.

Command:

$ perl -ne 'print "$. $_" if m/[\x80-\xFF]/'  utf8.txt

Output:

2 Pour être ou ne pas être4 Byť či nebyť5 是或不