how to detect invalid utf8 unicode/binary in a text file

Assuming you have your locale set to UTF-8 (see locale output), this works well to recognize invalid UTF-8 sequences:

grep -axv '.*' file.txt

Explanation (from grep man page):

-a, --text: treats file as text, essential prevents grep to abort once finding an invalid byte sequence (not being utf8)
-v, --invert-match: inverts the output showing lines not matched
-x '.*' (--line-regexp): means to match a complete line consisting of any utf8 character.

Hence, there will be output, which is the lines containing the invalid not utf8 byte sequence containing lines (since inverted -v)

linux bash utf-8 character-encoding

I would grep for non ASCII characters.

With GNU grep with pcre (due to -P, not available always. On FreeBSD you can use pcregrep in package pcre2) you can do:

grep -P "[\x80-\xFF]" file

Reference in How Do I grep For all non-ASCII Characters in UNIX. So, in fact, if you only want to check whether the file contains non ASCII characters, you can just say:

if grep -qP "[\x80-\xFF]" file ; then echo "file contains ascii"; fi#        ^#        silent grep

To remove these characters, you can use:

sed -i.bak 's/[\d128-\d255]//g' file

This will create a file.bak file as backup, whereas the original file will have its non ASCII characters removed. Reference in Remove non-ascii characters from csv.

linux bash utf-8 character-encoding

Try this, in order to find non-ASCII characters from the shell.

Command:

$ perl -ne 'print "$. $_" if m/[\x80-\xFF]/'  utf8.txt

Output:

2 Pour être ou ne pas être4 Byť či nebyť5 是或不

CodeHunter

how to detect invalid utf8 unicode/binary in a text file

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last