What's the best way to identify unicode encoded text files in Windows?
See “How to detect the character encoding of a text-file?” or “How to reliably guess the encoding [...]?”
- UTF-8 can be detected with validation. You can also look for the BOM
EF BB BF
, but don't rely on it. - UTF-16 can be detected by looking for the BOM.
- UTF-32 can be detected by validation, or by the BOM.
- Otherwise assume the ANSI code page.
Our codebase doesn't include any non-ASCII chars. I will try to grep for the BOM in files in our codebase. Thanks for the clarification.
Well that makes things a lot simpler. UTF-8 without non-ASCII chars is ASCII.
Unicode is a standard, it is not an encoding. There are many encodings that implement Unicode, including UTF-8, UTF-16, UCS-2, and others. The translation of any of these encodings to ASCII depends entirely on what encoding your "different editors" use.
Some editors insert byte-order marks of BOMs at the start of Unicode files. If your editors do that, you can use them to detect the encoding.
ANSI is a standards body that has published several encodings for digital character data. The "ANSI" encoding used by MS DOS and supported in Windows is actually CP-1252, not an ANSI standard.
Does your codebase include non-ASCII characters? You may have better compatibility using a Unicode encoding rather than an ANSI one or CP-1252.
Actually, if you want to find out in windows if a file is unicode, simply run findstr on the file for a string you know is in there.
findstr /I /C:"SomeKnownString" file.txt
It will come back empty. Then to be sure, run findstr on a letter or digit you know is in the file:
FindStr /I /C:"P" file.txt
You will probably get many occurrences and the key is that they will be spaced apart. This is a sign the file is unicode and not ascii.
Hope this helps.