file.encoding has no effect, LC_ALL environment variable does it

java linux file encoding internationalization

Note: So finally I think that I have nailed it down. I am not confirming that it is right. But with some code reading and tests this is what I found out and I don't have additional time to look into it. If anyone is interested they can check it out and tell if this answer is right or wrong - I would be glad :)

The reference I used was from this tarball available at OpenJDK's site: openjdk-6-src-b25-01_may_2012.tar.gz

Java natively translates all string to platform's local encoding in this method: jdk/src/share/native/common/jni_util.c - JNU_GetStringPlatformChars() . System property sun.jnu.encoding is used to determine the platform's encoding.
The value of sun.jnu.encoding is set at jdk/src/solaris/native/java/lang/java_props_md.c - GetJavaProperties() using setlocale() method of libc. Environment variable LC_ALL is used to set the value of sun.jnu.encoding. Value given at the command prompt using -Dsun.jnu.encoding option to Java is ignored.
Call to File.exists() has been coded in file jdk/src/share/classes/java/io/File.java and it returns as
return ((fs.getBooleanAttributes(this) & FileSystem.BA_EXISTS) != 0);
getBooleanAttributes() is natively coded (and I am skipping steps in code browsing through many files) in jdk/src/share/native/java/io/UnixFileSystem_md.c in function :Java_java_io_UnixFileSystem_getBooleanAttributes0(). Here the macro WITH_FIELD_PLATFORM_STRING(env, file, ids.path, path) converts path string to platform's encoding.
So conversion to wrong encoding will actually send a wrong C string (char array) to subsequent call to stat() method. And it will return with result that file cannot be found.

LESSON: LC_ALL is very important

java linux file encoding internationalization

I'm not sure where you read about file.encoding. I don't see it mentioned with the other standard properties as documented with System.getProperties. But judging from my experiments, it seems that this value influences the encoding of file content, not file names. System.out in particular will not print non-ASCII characters if file.encoding is POSIX.

On the other hand, the Linux way to decide which encoding applies to file names is the LC_CTYPE facet of the current locale setting. I see no reason why Java should override this. As many other platforms (Windows in particular) always use Unicode for file names, not bytes, there is little point in exposing the byte-level details of the file system to a Java application.

java linux file encoding internationalization

Please see bug 4163515 at java.com. It explains that:

file.encoding is specific to Sun (now Oracle) implementation of JVM - others may not support it
Shall be considered read-only
To change it you shall modify environment in which the JVM runs (which is what you did with LC_ALL)

Also note that even if changing file.encoding "works" for your platform, you shall not do that - as it does not change default encoding used by Oracle JVM in general, but only in some subsystems. As the bug shows default encoding used by String constructors taking byte arrays are unaffected by this setting.

CodeHunter

file.encoding has no effect, LC_ALL environment variable does it

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last