What is the encoding of argv?

c linux unicode encoding

Thanks everyone for your responses. I have learnt quite a lot about this issue and have discovered the following things that has resolved my question:

As discussed, on Windows the argv is encoded using the current code page. However, you can retrieve the command line as UTF-16 using GetCommandLineW. Use of argv is not recommended for modern Windows apps with unicode support because code pages are deprecated.
On Unixes, the argv has no fixed encoding:
a) File names inserted by tab-completion/globbing will occur in argv verbatim as exactly the byte sequences by which they are named on disk. This is true even if those byte sequences make no sense in the current locale.
b) Input entered directly by the user using their IME will occur in argv in the locale encoding. (Ubuntu seems to use LOCALE to decide how to encode IME input, whereas OS X uses the Terminal.app encoding Preference.)

This is annoying for languages such as Python, Haskell or Java, which want to treat command line arguments as strings. They need to decide how to decode argv into whatever encoding is used internally for a String (which is UTF-16 for those languages). However, if they just use the locale encoding to do this decoding, then valid filenames in the input may fail to decode, causing an exception.

The solution to this problem adopted by Python 3 is a surrogate-byte encoding scheme (http://www.python.org/dev/peps/pep-0383/) which represents any undecodable byte in argv as special Unicode code points. When that code point is decoded back to a byte stream, it just becomes the original byte again. This allows for roundtripping data from argv that is not valid in the current encoding (i.e. a filename named in something other than the current locale) through the native Python string type and back to bytes with no loss of information.

As you can see, the situation is pretty messy :-)

c linux unicode encoding

I can only speak about Windows for now. On Windows, code pages are only meant for legacy applications and not used by the system or by modern applications. Windows uses UTF-16 (and has done so for ages) for everything: text display, file names, the terminal, the system API. Conversions between UTF-16 and the legacy code pages are only performed at the highest possible level, directly at the interface between the system and the application (technically, the older API functions are implemented twice—one function FunctionW that does the real work and expects UTF-16 strings, and one compatibility function FunctionA that simply converts input strings from the current (thread) code page to UTF-16, calls the FunctionW, and converts back the results). Tab-completion should always yield UTF-16 strings (it definitely does when using a TrueType font) because the console uses only UTF-16 as well. The tab-completed UTF-16 file name is handed over to the application. If now that application is a legacy application (i.e., it uses main instead of wmain/GetCommandLineW etc.), then the Microsoft C runtime (probably) uses GetCommandLineA to have the system convert the command line. So basically I think what you're saying about Windows is correct (only that there is probably no conversion involved while tab-completing): the argv array will always contain the arguments in the code page of the current application because the information what code page (L1) the original program has uses has been irreversibly lost during the intermediate UTF-16 stage.

The conclusion is as always on Windows: Avoid the legacy code pages; use the UTF-16 API wherever you can. If you have to use main instead of wmain (e.g., to be platform independent), use GetCommandLineW instead of the argv array.

c linux unicode encoding

The output from your test app needed some modifications to make any sense, you need the hex codes and you need to get rid of the negative values.Or you can't print things like UTF-8 special chars so you can read them.

First the modified SW:

#include <stdio.h>int main(int argc, char **argv){    if (argc < 2) {        printf("Not enough arguments\n");        return 1;    }    int len = 0;    for (unsigned char *c = argv[1]; *c; c++, len++) {        printf("%x ", (*c));    }    printf("\nLength: %d\n", len);    return 0;}

Then on my Ubuntu box that is using UTF-8 I get this output.

$> gcc -std=c99 argc.c -o argc$> ./argc 1ü31 c3 bc Length: 3

And here you can see that in my case ü is encoded over 2 chars, and that the 1 is a single char. More or less exactly what you expect from a UTF-8 encoding.

And this actually match what is in the env LANG varible.

$> env | grep LANGLANG=en_US.utf8

Hope this clarifies the linux case a little.

/Good luck

CodeHunter

What is the encoding of argv?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last