What's the best way to identify unicode encoded text files in Windows? What's the best way to identify unicode encoded text files in Windows? windows windows

What's the best way to identify unicode encoded text files in Windows?


See “How to detect the character encoding of a text-file?” or “How to reliably guess the encoding [...]?”

  • UTF-8 can be detected with validation. You can also look for the BOM EF BB BF, but don't rely on it.
  • UTF-16 can be detected by looking for the BOM.
  • UTF-32 can be detected by validation, or by the BOM.
  • Otherwise assume the ANSI code page.

Our codebase doesn't include any non-ASCII chars. I will try to grep for the BOM in files in our codebase. Thanks for the clarification.

Well that makes things a lot simpler. UTF-8 without non-ASCII chars is ASCII.


Unicode is a standard, it is not an encoding. There are many encodings that implement Unicode, including UTF-8, UTF-16, UCS-2, and others. The translation of any of these encodings to ASCII depends entirely on what encoding your "different editors" use.

Some editors insert byte-order marks of BOMs at the start of Unicode files. If your editors do that, you can use them to detect the encoding.

ANSI is a standards body that has published several encodings for digital character data. The "ANSI" encoding used by MS DOS and supported in Windows is actually CP-1252, not an ANSI standard.

Does your codebase include non-ASCII characters? You may have better compatibility using a Unicode encoding rather than an ANSI one or CP-1252.


Actually, if you want to find out in windows if a file is unicode, simply run findstr on the file for a string you know is in there.

findstr /I /C:"SomeKnownString" file.txt

It will come back empty. Then to be sure, run findstr on a letter or digit you know is in the file:

FindStr /I /C:"P" file.txt

You will probably get many occurrences and the key is that they will be spaced apart. This is a sign the file is unicode and not ascii.

Hope this helps.