Reading UTF-8 from stdin using fgets() on Windows Reading UTF-8 from stdin using fgets() on Windows powershell powershell

Reading UTF-8 from stdin using fgets() on Windows


Windows uses UTF16. Most likely Windows' console doesn't support UTF8.

Use UTF16 along with wide string functions (wcsxxx instead of strxxx). You can then use WideCharToMultiByte to convert UTF16 to UTF8. Example:

#include <stdio.h>#include <string.h>#include <io.h> //for _setmode#include <fcntl.h> //for _O_U16TEXTint main(){    _setmode(_fileno(stdout), _O_U16TEXT);    _setmode(_fileno(stdin), _O_U16TEXT);    wchar_t s[64];    fgetws(s, 64, stdin);    _putws(s);    return 0;}

Note that you can't use ANSI print functions after calling _setmode(_fileno(stdout), _O_U16TEXT), it has to be reset. You may try something like the functions below which reset the text mode.

char* mygets(int wlen){    //may require fflush here, see _setmode documentation    int save = _setmode(_fileno(stdin), _O_U16TEXT);    wchar_t *wstr = malloc(wlen * sizeof(wchar_t));    fgetws(wstr, wlen, stdin);    //make UTF-8:    int len = WideCharToMultiByte(CP_UTF8, 0, wstr, -1, 0, 0, 0, 0);    if (!len) return NULL;    char* str = malloc(len);    WideCharToMultiByte(CP_UTF8, 0, wstr, -1, str, len, 0, 0);    free(wstr);    _setmode(_fileno(stdin), save);    return str;}void myputs(const char* str){    //may require fflush here, see _setmode documentation    int save = _setmode(_fileno(stdout), _O_U16TEXT);    //make UTF-16    int wlen = MultiByteToWideChar(CP_UTF8, 0, str, -1, 0, 0);    if (!wlen) return;    wchar_t* wstr = malloc(wlen * sizeof(wchar_t));    memset(wstr, 0, wlen * 2);    MultiByteToWideChar(CP_UTF8, 0, str, -1, wstr, wlen);    _putws(wstr);    _setmode(_fileno(stdout), save);}int main(){    char* utf8 = mygets(100);    if (utf8)    {        myputs(utf8);        free(utf8);    }    return 0;}


All windows native string manipulations (with very rarely exceptions) was in UNICODE (UTF-16) - so we must use unicode functions anywhere. use ANSI variant - very bad practice. if you will be use unicode functions in your example - all will be work correct. with ANSI this not work by .. windows bug !i can cover this with all details (researched on win 8.1):

1) in console server process exist 2 global variables:

UINT gInputCodePage, gOutputCodePage;

it can be read/write by GetConsoleCP/SetConsoleCP and GetConsoleOutputCP/SetConsoleOutputCP.they used as first argument for WideCharToMultiByte/MultiByteToWideChar when need convert. if you use only unicode functions - they never used

2.a) when you write to console UNICODE text - it will be writen as is without any conversions. on server side this done in SB_DoSrvWriteConsole function. look picture:enter image description here2.b) when you write to console ANSI text - SB_DoSrvWriteConsole also will be called, but with one additional step - MultiByteToWideChar(gOutputCodePage, ...) - your text will be converted to UNICODE first. enter image description herebut here one moment. look:enter image description herein MultiByteToWideChar call cchWideChar == cbMultiByte. if we use only 'english' charset (chars < 0x80) length of UNICODE and multibyte strings in chars always equal, but with another languages - usual Multibyte version use more chars than UNICODE but here this is not problem, simply size of out buffer more then need, but it is ok. so you printf in general will be work correct. one note only - if you hardcode multibyte string in source code - faster of all it will be in CP_ACP form, and conversion to UNICODE with CP_UTF8 - give incorrect result. so this is depended in which format your source file saved on disk :)

3.a) when you read from console with UNICODE functions - you got exactly UNICODE text as is. here no any problem. if need - you can then direct by self convert it to multibyte

3.b) when you read from console with ANSI functions - server first convert UNICODE string to ANSI, and then return to you ANSI form. this done by function

int ConvertToOem(UINT CodePage /*=gInputCodePage*/, PCWSTR lpWideCharStr, int cchWideChar, PSTR lpMultiByteStr, int cbMultiByte){    if (CodePage == g_OEMCP)    {        ULONG BytesInOemString;        return 0 > RtlUnicodeToOemN(lpMultiByteStr, cbMultiByte, &BytesInOemString, lpWideCharStr, cchWideChar * sizeof(WCHAR)) ? 0 : BytesInOemString;    }    return WideCharToMultiByte(CodePage, 0, lpWideCharStr, cchWideChar, lpMultiByteStr, cbMultiByte, 0, 0);}

but let look more close, how ConvertToOem called:enter image description herehere again cbMultiByte == cchWideChar, but this is 100% bug ! multibyte string can be longer than UNICODE (in chars of course) . for example "Ä" - this is 1 UNICODE char and 2 UTF8 chars. as result WideCharToMultiByte return 0. (ERROR_INSUFFICIENT_BUFFER )