how to detect invalid utf8 unicode/binary in a text file
Assuming you have your locale set to UTF-8 (see locale
output), this works well to recognize invalid UTF-8 sequences:
grep -axv '.*' file.txt
Explanation (from grep
man page):
- -a, --text: treats file as text, essential prevents grep to abort once finding an invalid byte sequence (not being utf8)
- -v, --invert-match: inverts the output showing lines not matched
- -x '.*' (--line-regexp): means to match a complete line consisting of any utf8 character.
Hence, there will be output, which is the lines containing the invalid not utf8 byte sequence containing lines (since inverted -v)
I would grep
for non ASCII characters.
With GNU grep with pcre (due to -P
, not available always. On FreeBSD you can use pcregrep in package pcre2) you can do:
grep -P "[\x80-\xFF]" file
Reference in How Do I grep For all non-ASCII Characters in UNIX. So, in fact, if you only want to check whether the file contains non ASCII characters, you can just say:
if grep -qP "[\x80-\xFF]" file ; then echo "file contains ascii"; fi# ^# silent grep
To remove these characters, you can use:
sed -i.bak 's/[\d128-\d255]//g' file
This will create a file.bak
file as backup, whereas the original file
will have its non ASCII characters removed. Reference in Remove non-ascii characters from csv.
Try this, in order to find non-ASCII characters from the shell.
Command:
$ perl -ne 'print "$. $_" if m/[\x80-\xFF]/' utf8.txt
Output:
2 Pour être ou ne pas être4 Byť či nebyť5 是或不