Why is address 0x400000 chosen as a start of text segment in x86_64 ABI? Why is address 0x400000 chosen as a start of text segment in x86_64 ABI? linux linux

Why is address 0x400000 chosen as a start of text segment in x86_64 ABI?


Bottom line: some technical limitations that amd64 has in using large addresses suggest dedicating the lower 2GiB of address space to code and data for efficiency. Thus the stack has been relocated out of this range.


In i386 ABI1

  • stack is located before the code, growing from just under 0x8048000 downwards. Which provides "a little over 128 MBfor the stack and about 2 GB for text and data" (p. 3-22).
  • Dynamic segments start at 0x80000000 (2GiB),
  • and the kernel occupies the "reserved area" at the top which the spec allows to be up to 1GiB, starting at at least 0xC0000000 (p. 3-21) (which is what it typically does).
  • The main program is not required to be position-independent.
  • An implementation is not required to catch null pointer access (p. 3-21) but it's reasonable to expect that some of the stack space above 128MiB (which is 288KiB) will be reserved for that purpose.

amd64 (whose ABI is formulated as an amendment to the i386 one (p. 9)) has a vastly bigger (48-bit) address space but most instructions only accept 32-bit immediate operands (which include direct addresses and offsets in jump instructions), requiring more work and less efficient code (especially when taking instruction interdependency into consideration) to handle larger values. Measures to work around these limitations are summarized by the authors by introducing a few "code models" they recommend to use to "allow the compiler to generate better code". (p. 33)

  • Specifically, the first of them, "Small code model", suggests using addresses "in the range from 0 to 231-224-1 or from 0x00000000 to 0x7effffff" which allows some very efficient relative references and array iteration. This is 1.98GiB which is more than enough for many programs.
  • "Medium code model" is based on the previous one, splitting the data into a "fast" part under the above boundary and the "slower" remaining part which requires a special instruction to access. While code remains under the boundary.
  • And only the "large" model makes no assumptions about sizes, requiring the compiler "to use the movabs instruction, as in the mediumcode model, even for dealing with addresses inside the text section. Additionally, indirect branches are needed when branching to addresses whoseoffset from the current instruction pointer is unknown." They go on to suggest splitting the code base into multiple shared libraries since these measures do not apply for relative references with offsets that are known to be within bounds (as outlined in "Small position independent code model").

Thus the stack was moved to under the shared library space (0x80000000000, 128GiB) because its addresses are never immediate operands, always referenced either indirectly or with lea/mov from another reference, thus only relative offset limitations apply.


The above explains why the loading address was moved to a lower address. Now, why was it moved to exactly 0x400000 (4MiB)? Here, I came empty so, summarizing what I've read in the ABI specs, I can only guess that it felt "just right":

  • It's large enough to catch any likely incorrect structure offset, allowing for larger data units that amd64 operates on, yet small enough to not waste much of the valuable starting 2GiB of address space.
  • It's equal to the largest practical page size to date and is a multiple of all other virtual memory unit sizes one can think of.

1Note that actual x32 Linuxes have been deviating from this layout more and more as time goes. But we're talking about the ABI spec here since the amd64 one is formally based on it rather than any derived layout (see its paragraph for citation).


Static code/data at low addresses, stack at high addresses, is the traditional model. x86-64 follows that; i386 was the unusual one. (With "the heap" in the middle, even though that's not a real thing in asm; there's .data/.bss above .text, brk adding more space just past .bss, and mmap picking random addresses in between.)

The i386 layout left room to put the stack below code, but modern Linux didn't do that anyway. You still get stack addresses like 0xffffe000 in 32-bit code (e.g. under a 64-bit kernel). I'm not sure where a modern build of a 32-bit kernel would put user-space stacks. Of course that's just for the main thread's stack; stacks for new threads have to be allocated manually, usually with mmap.


Why 0x400000 (4 MiB) specifically for the ld default base address?

High enough to avoid mmap_min_addr (default 64k) and leave a gap so NULL deref is still likely to fault noisily, instead of silently read code. Even if it's like ptr[i] with some large i. But otherwise near the bottom of virtual address space is a good place,

Also to optimize the page tables: they're a sparse radix tree (diagram in this answer). Ideally the pages in use share as many higher levels of the tree as possible, so higher levels of the tree have mostly "not present" entries. Less for the kernel to allocate & manage, and the HW page-table walker can internally cache higher level entries (PDE cache) to speed up TLB misses in 4k pages when they're in the same 2M, 1G, or 512G region. And the page-walker(s) accesses memory through cache, so smaller page tables also mean less cache footprint from those accesses.

0x400000 = 4MiB. It's the start of a 2MiB group of pages near the start of the low 1GiB of virtual address space. So an executable with larger code and/or static data that needs multiple pages will have them all in the same subtree of the page tables, touching as few as possible different 1G and 2M regions.

Well, almost as few 1G regions as possible: starting at 0x40000000 (1 GiB) would have put it at the very start of a 1GiB region, not skipping the first two 2MiB largepages of it. But that only matters if your static data size was just below 1GiB, otherwise you still fit in the first 1GiB hugepage region, or extended into the 2nd one anyway.