Why is executing Java code in comments with certain Unicode characters allowed?

java unicode comments

Unicode decoding takes place before any other lexical translation. The key benefit of this is that it makes it trivial to go back and forth between ASCII and any other encoding. You don't even need to figure out where comments begin and end!

As stated in JLS Section 3.3 this allows any ASCII based tool to process the source files:

[...] The Java programming language specifies a standard way of transforming a program written in Unicode into ASCII that changes a program into a form that can be processed by ASCII-based tools. [...]

This gives a fundamental guarantee for platform independence (independence of supported character sets) which has always been a key goal for the Java platform.

Being able to write any Unicode character anywhere in the file is a neat feature, and especially important in comments, when documenting code in non-latin languages. The fact that it can interfere with the semantics in such subtle ways is just an (unfortunate) side-effect.

There are many gotchas on this theme and Java Puzzlers by Joshua Bloch and Neal Gafter included the following variant:

Is this a legal Java program? If so, what does it print?

\u0070\u0075\u0062\u006c\u0069\u0063\u0020\u0020\u0020\u0020\u0063\u006c\u0061\u0073\u0073\u0020\u0055\u0067\u006c\u0079\u007b\u0070\u0075\u0062\u006c\u0069\u0063\u0020\u0020\u0020\u0020\u0020\u0020\u0020\u0073\u0074\u0061\u0074\u0069\u0063\u0076\u006f\u0069\u0064\u0020\u006d\u0061\u0069\u006e\u0028\u0053\u0074\u0072\u0069\u006e\u0067\u005b\u005d\u0020\u0020\u0020\u0020\u0020\u0020\u0061\u0072\u0067\u0073\u0029\u007b\u0053\u0079\u0073\u0074\u0065\u006d\u002e\u006f\u0075\u0074\u002e\u0070\u0072\u0069\u006e\u0074\u006c\u006e\u0028\u0020\u0022\u0048\u0065\u006c\u006c\u006f\u0020\u0077\u0022\u002b\u0022\u006f\u0072\u006c\u0064\u0022\u0029\u003b\u007d\u007d

(This program turns out to be a plain "Hello World" program.)

In the solution to the puzzler, they point out the following:

More seriously, this puzzle serves to reinforce the lessons of the previous three: Unicode escapes are essential when you need to insert characters that can’t be represented in any other way into your program. Avoid them in all other cases.

Source: Java: Executing code in comments?!

java unicode comments

Since this hasn’t addressed yet, here an explanation, why the translation of Unicode escapes happens before any other source code processing:

The idea behind it was that it allows lossless translations of Java source code between different character encodings. Today, there is widespread Unicode support, and this doesn’t look like a problem, but back then it wasn’t easy for a developer from a western country to receive some source code from his Asian colleague containing Asian characters, make some changes (including compiling and testing it) and sending the result back, all without damaging something.

So, Java source code can be written in any encoding and allows a wide range of characters within identifiers, character and String literals and comments. Then, in order to transfer it losslessly, all characters not supported by the target encoding are replaced by their Unicode escapes.

This is a reversible process and the interesting point is that the translation can be done by a tool which doesn’t need to know anything about the Java source code syntax as the translation rule is not dependent on it. This works as the translation to their actual Unicode characters inside the compiler happens independently to the Java source code syntax as well. It implies that you can perform an arbitrary number of translation steps in both directions without ever changing the meaning of the source code.

This is the reason for another weird feature which hasn’t even mentioned: the \uuuuuuxxxx syntax:

When a translation tool is escaping characters and encounters a sequence that is already an escaped sequence, it should insert an additional u into the sequence, converting \ucafe to \uucafe. The meaning doesn’t change, but when converting into the other direction, the tool should just remove one u and replace only sequences containing a single u by their Unicode characters. That way, even Unicode escapes are retained in their original form when converting back and forth. I guess, no-one ever used that feature…

java unicode comments

I'm going to completely ineffectually add the point, just because I can't help myself and I haven't seen it made yet, that the question is invalid since it contains a hidden premise which is wrong, namely that the code is in a comment!

In Java source code \u000d is equivalent in every way to an ASCII CR character. It is a line ending, plain and simple, wherever it occurs. The formatting in the question is misleading, what that sequence of characters actually syntactically corresponds to is:

public static void main(String... args) {   // The comment below is no typo.    //  System.out.println("Hello World!");}

IMHO the most correct answer is therefore: the code executes because it isn't in a comment; it's on the next line. "Executing code in comments" is not allowed in Java, just like you would expect.

Much of the confusion stems from the fact that syntax highlighters and IDEs aren't sophisticated enough to take this situation into account. They either don't process the unicode escapes at all, or they do it after parsing the code instead of before, like javac does.

CodeHunter

Why is executing Java code in comments with certain Unicode characters allowed?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last