What encoding/code page is cmd.exe using? What encoding/code page is cmd.exe using? windows windows

What encoding/code page is cmd.exe using?


Yes, it’s frustrating—sometimes type and other programsprint gibberish, and sometimes they do not.

First of all, Unicode characters will only display if thecurrent console font contains the characters. So usea TrueType font like Lucida Console instead of the default Raster Font.

But if the console font doesn’t contain the character you’re trying to display,you’ll see question marks instead of gibberish. When you get gibberish,there’s more going on than just font settings.

When programs use standard C-library I/O functions like printf, theprogram’s output encoding must match the console’s output encoding, oryou will get gibberish. chcp shows and sets the current codepage. Alloutput using standard C-library I/O functions is treated as if it is in thecodepage displayed by chcp.

Matching the program’s output encoding with the console’s output encodingcan be accomplished in two different ways:

  • A program can get the console’s current codepage using chcp orGetConsoleOutputCP, and configure itself to output in that encoding, or

  • You or a program can set the console’s current codepage using chcp orSetConsoleOutputCP to match the default output encoding of the program.

However, programs that use Win32 APIs can write UTF-16LE strings directlyto the console withWriteConsoleW.This is the only way to get correct output without setting codepages. Andeven when using that function, if a string is not in the UTF-16LE encodingto begin with, a Win32 program must pass the correct codepage toMultiByteToWideChar.Also, WriteConsoleW will not work if the program’s output is redirected;more fiddling is needed in that case.

type works some of the time because it checks the start of each file fora UTF-16LE Byte Order Mark(BOM), i.e. the bytes 0xFF 0xFE.If it finds such amark, it displays the Unicode characters in the file using WriteConsoleWregardless of the current codepage. But when typeing any file without aUTF-16LE BOM, or for using non-ASCII characters with any commandthat doesn’t call WriteConsoleW—you will need to set theconsole codepage and program output encoding to match each other.


How can we find this out?

Here’s a test file containing Unicode characters:

ASCII     abcde xyzGerman    äöü ÄÖÜ ßPolish    ąęźżńłRussian   абвгдеж эюяCJK       你好

Here’s a Java program to print out the test file in a bunch of differentUnicode encodings. It could be in any programming language; it only printsASCII characters or encoded bytes to stdout.

import java.io.*;public class Foo {    private static final String BOM = "\ufeff";    private static final String TEST_STRING        = "ASCII     abcde xyz\n"        + "German    äöü ÄÖÜ ß\n"        + "Polish    ąęźżńł\n"        + "Russian   абвгдеж эюя\n"        + "CJK       你好\n";    public static void main(String[] args)        throws Exception    {        String[] encodings = new String[] {            "UTF-8", "UTF-16LE", "UTF-16BE", "UTF-32LE", "UTF-32BE" };        for (String encoding: encodings) {            System.out.println("== " + encoding);            for (boolean writeBom: new Boolean[] {false, true}) {                System.out.println(writeBom ? "= bom" : "= no bom");                String output = (writeBom ? BOM : "") + TEST_STRING;                byte[] bytes = output.getBytes(encoding);                System.out.write(bytes);                FileOutputStream out = new FileOutputStream("uc-test-"                    + encoding + (writeBom ? "-bom.txt" : "-nobom.txt"));                out.write(bytes);                out.close();            }        }    }}

The output in the default codepage? Total garbage!

Z:\andrew\projects\sx\1259084>chcpActive code page: 850Z:\andrew\projects\sx\1259084>java Foo== UTF-8= no bomASCII     abcde xyzGerman    ├ñ├Â├╝ ├ä├û├£ ├ƒPolish    ─à─Ö┼║┼╝┼ä┼éRussian   ð░ð▒ð▓ð│ð┤ðÁð ÐìÐÄÐÅCJK       õ¢áÕÑ¢= bom´╗┐ASCII     abcde xyzGerman    ├ñ├Â├╝ ├ä├û├£ ├ƒPolish    ─à─Ö┼║┼╝┼ä┼éRussian   ð░ð▒ð▓ð│ð┤ðÁð ÐìÐÄÐÅCJK       õ¢áÕÑ¢== UTF-16LE= no bomA S C I I           a b c d e   x y z G e r m a n         õ ÷ ³   ─ Í ▄   ▀ P o l i s h         ♣☺↓☺z☺|☺D☺B☺ R u s s i a n       0♦1♦2♦3♦4♦5♦6♦  M♦N♦O♦ C J K               `O}Y = bom ■A S C I I           a b c d e   x y z G e r m a n         õ ÷ ³   ─ Í ▄   ▀ P o l i s h         ♣☺↓☺z☺|☺D☺B☺ R u s s i a n       0♦1♦2♦3♦4♦5♦6♦  M♦N♦O♦ C J K               `O}Y == UTF-16BE= no bom A S C I I           a b c d e   x y z G e r m a n         õ ÷ ³   ─ Í ▄   ▀ P o l i s h        ☺♣☺↓☺z☺|☺D☺B R u s s i a n      ♦0♦1♦2♦3♦4♦5♦6  ♦M♦N♦O C J K              O`Y}= bom■  A S C I I           a b c d e   x y z G e r m a n         õ ÷ ³   ─ Í ▄   ▀ P o l i s h        ☺♣☺↓☺z☺|☺D☺B R u s s i a n      ♦0♦1♦2♦3♦4♦5♦6  ♦M♦N♦O C J K              O`Y}== UTF-32LE= no bomA   S   C   I   I                       a   b   c   d   e       x   y   z   G   e   r   m   a   n                   õ   ÷   ³       ─   Í   ▄       ▀   P   o   l   i   s   h                   ♣☺  ↓☺  z☺  |☺  D☺  B☺   R   u   s   s   i   a   n               0♦  1♦  2♦  3♦  4♦  5♦  6♦      M♦  N♦  O♦   C   J   K                               `O  }Y   = bom ■  A   S   C   I   I                       a   b   c   d   e       x   y   z   G   e   r   m   a   n                   õ   ÷   ³       ─   Í   ▄       ▀   P   o   l   i   s   h                   ♣☺  ↓☺  z☺  |☺  D☺  B☺   R   u   s   s   i   a   n               0♦  1♦  2♦  3♦  4♦  5♦  6♦      M♦  N♦  O♦   C   J   K                               `O  }Y   == UTF-32BE= no bom   A   S   C   I   I                       a   b   c   d   e       x   y   z   G   e   r   m   a   n                   õ   ÷   ³       ─   Í   ▄       ▀   P   o   l   i   s   h                  ☺♣  ☺↓  ☺z  ☺|  ☺D  ☺B   R   u   s   s   i   a   n              ♦0  ♦1  ♦2  ♦3  ♦4  ♦5  ♦6      ♦M  ♦N  ♦O   C   J   K                              O`  Y}= bom  ■    A   S   C   I   I                       a   b   c   d   e       x   y   z   G   e   r   m   a   n                   õ   ÷   ³       ─   Í   ▄       ▀   P   o   l   i   s   h                  ☺♣  ☺↓  ☺z  ☺|  ☺D  ☺B   R   u   s   s   i   a   n              ♦0  ♦1  ♦2  ♦3  ♦4  ♦5  ♦6      ♦M  ♦N  ♦O   C   J   K                              O`  Y}

However, what if we type the files that got saved? They contain the exactsame bytes that were printed to the console.

Z:\andrew\projects\sx\1259084>type *.txtuc-test-UTF-16BE-bom.txt■  A S C I I           a b c d e   x y z G e r m a n         õ ÷ ³   ─ Í ▄   ▀ P o l i s h        ☺♣☺↓☺z☺|☺D☺B R u s s i a n      ♦0♦1♦2♦3♦4♦5♦6  ♦M♦N♦O C J K              O`Y}uc-test-UTF-16BE-nobom.txt A S C I I           a b c d e   x y z G e r m a n         õ ÷ ³   ─ Í ▄   ▀ P o l i s h        ☺♣☺↓☺z☺|☺D☺B R u s s i a n      ♦0♦1♦2♦3♦4♦5♦6  ♦M♦N♦O C J K              O`Y}uc-test-UTF-16LE-bom.txtASCII     abcde xyzGerman    äöü ÄÖÜ ßPolish    ąęźżńłRussian   абвгдеж эюяCJK       你好uc-test-UTF-16LE-nobom.txtA S C I I           a b c d e   x y z G e r m a n         õ ÷ ³   ─ Í ▄   ▀ P o l i s h         ♣☺↓☺z☺|☺D☺B☺ R u s s i a n       0♦1♦2♦3♦4♦5♦6♦  M♦N♦O♦ C J K               `O}Yuc-test-UTF-32BE-bom.txt  ■    A   S   C   I   I                       a   b   c   d   e       x   y   z   G   e   r   m   a   n                   õ   ÷   ³       ─   Í   ▄       ▀   P   o   l   i   s   h                  ☺♣  ☺↓  ☺z  ☺|  ☺D  ☺B   R   u   s   s   i   a   n              ♦0  ♦1  ♦2  ♦3  ♦4  ♦5  ♦6      ♦M  ♦N  ♦O   C   J   K                              O`  Y}uc-test-UTF-32BE-nobom.txt   A   S   C   I   I                       a   b   c   d   e       x   y   z   G   e   r   m   a   n                   õ   ÷   ³       ─   Í   ▄       ▀   P   o   l   i   s   h                  ☺♣  ☺↓  ☺z  ☺|  ☺D  ☺B   R   u   s   s   i   a   n              ♦0  ♦1  ♦2  ♦3  ♦4  ♦5  ♦6      ♦M  ♦N  ♦O   C   J   K                              O`  Y}uc-test-UTF-32LE-bom.txt A S C I I           a b c d e   x y z G e r m a n         ä ö ü   Ä Ö Ü   ß P o l i s h         ą ę ź ż ń ł R u s s i a n       а б в г д е ж   э ю я C J K               你 好uc-test-UTF-32LE-nobom.txtA   S   C   I   I                       a   b   c   d   e       x   y   z   G   e   r   m   a   n                   õ   ÷   ³       ─   Í   ▄       ▀   P   o   l   i   s   h                   ♣☺  ↓☺  z☺  |☺  D☺  B☺   R   u   s   s   i   a   n               0♦  1♦  2♦  3♦  4♦  5♦  6♦      M♦  N♦  O♦   C   J   K                               `O  }Yuc-test-UTF-8-bom.txt´╗┐ASCII     abcde xyzGerman    ├ñ├Â├╝ ├ä├û├£ ├ƒPolish    ─à─Ö┼║┼╝┼ä┼éRussian   ð░ð▒ð▓ð│ð┤ðÁð ÐìÐÄÐÅCJK       õ¢áÕÑ¢uc-test-UTF-8-nobom.txtASCII     abcde xyzGerman    ├ñ├Â├╝ ├ä├û├£ ├ƒPolish    ─à─Ö┼║┼╝┼ä┼éRussian   ð░ð▒ð▓ð│ð┤ðÁð ÐìÐÄÐÅCJK       õ¢áÕÑ¢

The only thing that works is UTF-16LE file, with a BOM, printed to theconsole via type.

If we use anything other than type to print the file, we get garbage:

Z:\andrew\projects\sx\1259084>copy uc-test-UTF-16LE-bom.txt CON ■A S C I I           a b c d e   x y z G e r m a n         õ ÷ ³   ─ Í ▄   ▀ P o l i s h         ♣☺↓☺z☺|☺D☺B☺ R u s s i a n       0♦1♦2♦3♦4♦5♦6♦  M♦N♦O♦ C J K               `O}Y         1 file(s) copied.

From the fact that copy CON does not display Unicode correctly, we canconclude that the type command has logic to detect a UTF-16LE BOM at thestart of the file, and use special Windows APIs to print it.

We can see this by opening cmd.exe in a debugger when it goes to typeout a file:

enter image description here

After type opens a file, it checks for a BOM of 0xFEFF—i.e., the bytes0xFF 0xFE in little-endian—and if there is such a BOM, type sets aninternal fOutputUnicode flag. This flag is checked later to decidewhether to call WriteConsoleW.

But that’s the only way to get type to output Unicode, and only for filesthat have BOMs and are in UTF-16LE. For all other files, and for programsthat don’t have special code to handle console output, your files will beinterpreted according to the current codepage, and will likely show up asgibberish.

You can emulate how type outputs Unicode to the console in your own programs like so:

#include <stdio.h>#define UNICODE#include <windows.h>static LPCSTR lpcsTest =    "ASCII     abcde xyz\n"    "German    äöü ÄÖÜ ß\n"    "Polish    ąęźżńł\n"    "Russian   абвгдеж эюя\n"    "CJK       你好\n";int main() {    int n;    wchar_t buf[1024];    HANDLE hConsole = GetStdHandle(STD_OUTPUT_HANDLE);    n = MultiByteToWideChar(CP_UTF8, 0,            lpcsTest, strlen(lpcsTest),            buf, sizeof(buf));    WriteConsole(hConsole, buf, n, &n, NULL);    return 0;}

This program works for printing Unicode on the Windows console using thedefault codepage.


For the sample Java program, we can get a little bit of correct output bysetting the codepage manually, though the output gets messed up in weird ways:

Z:\andrew\projects\sx\1259084>chcp 65001Active code page: 65001Z:\andrew\projects\sx\1259084>java Foo== UTF-8= no bomASCII     abcde xyzGerman    äöü ÄÖÜ ßPolish    ąęźżńłRussian   абвгдеж эюяCJK       你好ж эюяCJK       你好 你好好�= bomASCII     abcde xyzGerman    äöü ÄÖÜ ßPolish    ąęźżńłRussian   абвгдеж эюяCJK       你好еж эюяCJK       你好  你好好�== UTF-16LE= no bomA S C I I           a b c d e   x y z…

However, a C program that sets a Unicode UTF-8 codepage:

#include <stdio.h>#include <windows.h>int main() {    int c, n;    UINT oldCodePage;    char buf[1024];    oldCodePage = GetConsoleOutputCP();    if (!SetConsoleOutputCP(65001)) {        printf("error\n");    }    freopen("uc-test-UTF-8-nobom.txt", "rb", stdin);    n = fread(buf, sizeof(buf[0]), sizeof(buf), stdin);    fwrite(buf, sizeof(buf[0]), n, stdout);    SetConsoleOutputCP(oldCodePage);    return 0;}

does have correct output:

Z:\andrew\projects\sx\1259084>.\testASCII     abcde xyzGerman    äöü ÄÖÜ ßPolish    ąęźżńłRussian   абвгдеж эюяCJK       你好

The moral of the story?

  • type can print UTF-16LE files with a BOM regardless of your current codepage
  • Win32 programs can be programmed to output Unicode to the console, usingWriteConsoleW.
  • Other programs which set the codepage and adjust their output encoding accordingly can print Unicode on the console regardless of what the codepage was when the program started
  • For everything else you will have to mess around with chcp, and will probably still get weird output.


Type

chcp

to see your current code page (as Dewfy already said).

Use

nlsinfo

to see all installed code pages and find out what your code page number means.

You need to have Windows Server 2003 Resource kit installed (works on Windows XP) to use nlsinfo.


To answer your second query re. how encoding works, Joel Spolsky wrote a great introductory article on this. Strongly recommended.