Why does one long string take MORE space than lots of small strings? Why does one long string take MORE space than lots of small strings? arrays arrays

Why does one long string take MORE space than lots of small strings?


To expand on August's answer, I figured I should explain exactly how this is compiled and what takes up space at the binary level.

To simplify, I'll use the following example with shorter strings. The important part is that there are some duplicate strings.

public class StringArray{    private static final String[] stringArray = {        "AAA",        "AAB",        "AAA",        "AAC",        "AAA"    };}public class LongString{    private static final String longString =        "AAA"        + "AAB"        + "AAA"        + "AAC"        + "AAA"    ;       }

Now there are three important things to understand when compiling this code.

  1. Constant string concatenation is done at compile time. In fact, this is a special case of compile time simplification of constant expressions. You can find the exact rules for what is considered a constant expression in the Java Language Specification.
  2. Array brace initializers are syntactic sugar. The code is equivalent to creating an array and assigning the elements one by one. (Note that this is specific to Java bytecode. Dalvik (i.e. Android) does have special shorthand instructions for array initialization)
  3. Inline initializers are syntactic sugar. The code is equivalent to manually initializing the fields in the static initializer method.

Edit: One minor detail is that in the special case of static final fields initialized to a constant expression inline, the field is initialized using the ConstantValue attribute instead of in the static initializer and all uses are inlined. So in the case of LongString, #3 will actually result in different bytecode, but since the constant pool entries for the string are the same, the filesize taken up by the strings won't change.

Put them together, and the above code is equivalent to the following.

public class StringArray {    private static final String[] stringArray;    static {        String[] temp = new String[5];        temp[0] = "AAA";                temp[1] = "AAB";                temp[2] = "AAA";                temp[3] = "AAC";                temp[4] = "AAA";            stringArray = temp;        } }public class LongString {    private static final String longString;    static {        longString = "AAAAABAAAAACAAA";    } }

Now this desugared code still shows the duplicate strings multiple times in the array example. To understand the classfile size behavior, you have to understand what it is compiled to.

When you access a constant string, the bytecode contains a load constant instruction (ldc or ldc_w), which is a one byte opcode followed by an index into the class's constant pool. The constant pool is a seperate section of the classfile where a list of constants is stored. Obviously, the compiler will only store each constant once.

So the bytecode for StringArray looks something like this (stripping away a few details that aren't relevant here). Note that there are only 3 unique strings stored in the constant pool. (Actually there are a lot more constant pool entries relating to other parts of the classfile, but they're not important here).

.class super StringArray.field static final private stringArray [Ljava/lang/String;.const [1] = String 'AAA'.const [2] = String 'AAB'.const [3] = String 'AAC'.method static <clinit> : ()V    iconst_5    anewarray java/lang/String    astore_0    aload_0    iconst_0    ldc [1]    aastore    aload_0    iconst_1    ldc [2]    aastore    aload_0    iconst_2    ldc [1]    aastore    aload_0    iconst_3    ldc [3]    aastore    aload_0    iconst_4    ldc [1]    aastore    aload_0    putstatic StringArray stringArray [Ljava/lang/String;    return.end method

Whereas LongString looks something like

.class super LongString.field static final private longString Ljava/lang/String;.const [1] = String 'AAAAABAAAAACAAA'.method static <clinit> : ()V    ldc [1]    putstatic LongString longString Ljava/lang/String;    return.end method

So with the first version, duplicate strings can be stored only once, whereas with the second version, the entire string has to be stored. So is the first version always better? Not so fast. It has the advantage of not storing duplicated strings, but as you may have noticed, there is a big per element overhead in the array initialization. Which one is better depends on how long your strings are and how many duplicates you have.

P.S. At the binary level, constant strings are encoded with a modified UTF8 encoding. The upshot is that characters 1-127 are one byte each but null characters are two bytes. So you can save some space by offsetting everything 1.


You have 18 duplicate Strings in your array (StringArray class). Since each array entry is a separate String literal, there will only be constant pool entries for unique Strings. However, in your second example (LongString class), all the duplicates are kept and concatenated at compile-time.

It's a bit hard to look at the class file you posted, because of the size of the Strings. I used a shorter version, but it still has the same concept:

public class LongString{    private static final String longString =         "a" + "a" + "a" + "a" + "a" + "a" + "a" + "a" + "b";}

public class StringArray{    private static final String[] stringArray = {        "a", "a", "a", "a", "a", "a", "a", "a",        "b"    };}

If we open up the class file LongString.class, you can see the entire String:

LongString.class

But if you look at the class file StringArray.class, duplicates have been removed:

enter image description here

The space between a and b is 01 00 01, which marks a CONSTANT_Utf8_info entry of length 1.

In your case, reusing Strings would save 2,304 characters.