How do I grep for all non-ASCII characters? How do I grep for all non-ASCII characters? unix unix

How do I grep for all non-ASCII characters?


You can use the command:

grep --color='auto' -P -n "[\x80-\xFF]" file.xml

This will give you the line number, and will highlight non-ascii chars in red.

In some systems, depending on your settings, the above will not work, so you can grep by the inverse

grep --color='auto' -P -n "[^\x00-\x7F]" file.xml

Note also, that the important bit is the -P flag which equates to --perl-regexp: so it will interpret your pattern as a Perl regular expression. It also says that

this is highly experimental and grep -P may warn of unimplemented features.


Instead of making assumptions about the byte range of non-ASCII characters, as most of the above solutions do, it's slightly better IMO to be explicit about the actual byte range of ASCII characters instead.

So the first solution for instance would become:

grep --color='auto' -P -n '[^\x00-\x7F]' file.xml

(which basically greps for any character outside of the hexadecimal ASCII range: from \x00 up to \x7F)

On Mountain Lion that won't work (due to the lack of PCRE support in BSD grep), but with pcre installed via Homebrew, the following will work just as well:

pcregrep --color='auto' -n '[^\x00-\x7F]' file.xml

Any pros or cons that anyone can think off?


The following works for me:

grep -P "[\x80-\xFF]" file.xml

Non-ASCII characters start at 0x80 and go to 0xFF when looking at bytes. Grep (and family) don't do Unicode processing to merge multi-byte characters into a single entity for regex matching as you seem to want. The -P option in my grep allows the use of \xdd escapes in character classes to accomplish what you want.