Reading UTF-8 from stdin using fgets() on Windows

Windows uses UTF16. Most likely Windows' console doesn't support UTF8.

Use UTF16 along with wide string functions (wcsxxx instead of strxxx). You can then use WideCharToMultiByte to convert UTF16 to UTF8. Example:

#include <stdio.h>#include <string.h>#include <io.h> //for _setmode#include <fcntl.h> //for _O_U16TEXTint main(){    _setmode(_fileno(stdout), _O_U16TEXT);    _setmode(_fileno(stdin), _O_U16TEXT);    wchar_t s[64];    fgetws(s, 64, stdin);    _putws(s);    return 0;}

Note that you can't use ANSI print functions after calling _setmode(_fileno(stdout), _O_U16TEXT), it has to be reset. You may try something like the functions below which reset the text mode.

char* mygets(int wlen){    //may require fflush here, see _setmode documentation    int save = _setmode(_fileno(stdin), _O_U16TEXT);    wchar_t *wstr = malloc(wlen * sizeof(wchar_t));    fgetws(wstr, wlen, stdin);    //make UTF-8:    int len = WideCharToMultiByte(CP_UTF8, 0, wstr, -1, 0, 0, 0, 0);    if (!len) return NULL;    char* str = malloc(len);    WideCharToMultiByte(CP_UTF8, 0, wstr, -1, str, len, 0, 0);    free(wstr);    _setmode(_fileno(stdin), save);    return str;}void myputs(const char* str){    //may require fflush here, see _setmode documentation    int save = _setmode(_fileno(stdout), _O_U16TEXT);    //make UTF-16    int wlen = MultiByteToWideChar(CP_UTF8, 0, str, -1, 0, 0);    if (!wlen) return;    wchar_t* wstr = malloc(wlen * sizeof(wchar_t));    memset(wstr, 0, wlen * 2);    MultiByteToWideChar(CP_UTF8, 0, str, -1, wstr, wlen);    _putws(wstr);    _setmode(_fileno(stdout), save);}int main(){    char* utf8 = mygets(100);    if (utf8)    {        myputs(utf8);        free(utf8);    }    return 0;}

c powershell winapi utf-8

All windows native string manipulations (with very rarely exceptions) was in UNICODE (UTF-16) - so we must use unicode functions anywhere. use ANSI variant - very bad practice. if you will be use unicode functions in your example - all will be work correct. with ANSI this not work by .. windows bug !i can cover this with all details (researched on win 8.1):

1) in console server process exist 2 global variables:

UINT gInputCodePage, gOutputCodePage;

it can be read/write by GetConsoleCP/SetConsoleCP and GetConsoleOutputCP/SetConsoleOutputCP.they used as first argument for WideCharToMultiByte/MultiByteToWideChar when need convert. if you use only unicode functions - they never used

2.a) when you write to console UNICODE text - it will be writen as is without any conversions. on server side this done in SB_DoSrvWriteConsole function. look picture:2.b) when you write to console ANSI text - SB_DoSrvWriteConsole also will be called, but with one additional step - MultiByteToWideChar(gOutputCodePage, ...) - your text will be converted to UNICODE first. but here one moment. look:in MultiByteToWideChar call cchWideChar == cbMultiByte. if we use only 'english' charset (chars < 0x80) length of UNICODE and multibyte strings in chars always equal, but with another languages - usual Multibyte version use more chars than UNICODE but here this is not problem, simply size of out buffer more then need, but it is ok. so you printf in general will be work correct. one note only - if you hardcode multibyte string in source code - faster of all it will be in CP_ACP form, and conversion to UNICODE with CP_UTF8 - give incorrect result. so this is depended in which format your source file saved on disk :)

3.a) when you read from console with UNICODE functions - you got exactly UNICODE text as is. here no any problem. if need - you can then direct by self convert it to multibyte

3.b) when you read from console with ANSI functions - server first convert UNICODE string to ANSI, and then return to you ANSI form. this done by function

int ConvertToOem(UINT CodePage /*=gInputCodePage*/, PCWSTR lpWideCharStr, int cchWideChar, PSTR lpMultiByteStr, int cbMultiByte){    if (CodePage == g_OEMCP)    {        ULONG BytesInOemString;        return 0 > RtlUnicodeToOemN(lpMultiByteStr, cbMultiByte, &BytesInOemString, lpWideCharStr, cchWideChar * sizeof(WCHAR)) ? 0 : BytesInOemString;    }    return WideCharToMultiByte(CodePage, 0, lpWideCharStr, cchWideChar, lpMultiByteStr, cbMultiByte, 0, 0);}

but let look more close, how ConvertToOem called:here again cbMultiByte == cchWideChar, but this is 100% bug ! multibyte string can be longer than UNICODE (in chars of course) . for example "Ä" - this is 1 UNICODE char and 2 UTF8 chars. as result WideCharToMultiByte return 0. (ERROR_INSUFFICIENT_BUFFER )

CodeHunter

Reading UTF-8 from stdin using fgets() on Windows

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last