format string vulnerability - printf format string vulnerability - printf c c

format string vulnerability - printf


I think that the paper provides its printf() examples in a somewhat confusing way because the examples use string literals for format strings, and those don't generally permit the type of vulnerability being described. The format string vulnerability as described here depends on the format string being provided by user input.

So the example:

printf ("\x10\x01\x48\x08_%08x.%08x.%08x.%08x.%08x|%s|");

Might better be presented as:

/*  * in a real program, some user input source would be copied  * into the `outstring` buffer  */char outstring[80] = "\x10\x01\x48\x08_%08x.%08x.%08x.%08x.%08x|%s|";printf(outstring);

Since the outstring array is an automatic, the compiler will likely put it on the stack. After copying the user input to the outstring array, it'll look like the following as 'words' on the stack (assuming little endian):

outstring[0c]               // etc...outstring[08] 0x30252e78    // from "x.%0"outstring[04] 0x3830255f    // from "_%08"outstring[00] 0x08480110    // from the ""\x10\x01\x48\x08"

The compiler will put other items on the stack as it sees fit (other local variables, saved registers, whatever).

When the printf() call is about to be made, the stack might look like:

outstring[0c]               // etc...outstring[08] 0x30252e78    // from "x.%0"outstring[04] 0x3830255f    // from "_%08"outstring[00] 0x08480110    // from the ""\x10\x01\x48\x08"var1var2saved ECXsaved EDI

Note that I'm completely making those entries up - each compiler will use the stack in different ways (so a format string vulnerability has to be custom crafted for a particular exact scenario. In other words, you won't always use 5 dummy format specifiers like in this example - as the attacker you'd need to figure out how many dummies the particular vulnerability would need.

Now to call printf(), the argument (the address of outstring) is pushed on to the stack and printf() is called, so the argument area of the stack looks like:

outstring[0c]               // etc...outstring[08] 0x30252e78    // from "x.%0"outstring[04] 0x3830255f    // from "_%08"outstring[00] 0x08480110    // from the ""\x10\x01\x48\x08"var1var2var3saved ECXsaved EDI&outstring   // the one real argument to `printf()`

However, printf doesn't really know anything about how many arguments have been placed on the stack for it - it goes by the format specifiers it finds in the format string (the one argument it's 'sure' to get). So printf() gets the format string argument and starts processing it. When it gets to the 1st "%08x" that will correspond to the 'saved EDI' in my example, then next "%08x" will print the saved ECX' and so on. So the "%08x" format specifiers are just eating up data on the stack until it gets back to the string the attacker was able to input. Determining how many of those are needed is something an attacker would do by a kind of trial and error (probably by a test run that has a whole slew of "%08x" formats until he can 'see' where the format string starts).

Anyway, when printf() gets to processing the "%s" format specifier, it has consumed all the stack entries up to where the outstring buffer resides. The "%s" specifier treats its stack entry as a pointer, and the string that the user has put into that buffer has been carefully crafted to have a binary representation of 0x08480110, so printf() will print out whatever is at that address as an ASCIIZ string.


You have 6 format specifiers (5 lots of %08x and one of %s), but you do not provide values for those format specifiers. You immediately fall into the realm of undefined behaviour - anything could happen and there is no wrong answer.

However, in the normal course of events, the values passed to printf() would have been stored on the stack, so the code in printf() reads values off the stack as if the extra values had been passed. The function return address is on the stack, too. There is no guarantee that I can see that the value 0x08480110 will actually be produced. This sort of attack very much depends on the the specific program and faulty function call, and you might well get a very different value. The example code is most likely written assuming a 32-bit Intel (little-endian) CPU - rather than a 64-bit or big-endian CPU.


Adapting the code fragment, compiling it into a complete program, ignoring the compilation warnings, using a 32-bit compilation on MacOS X 10.6.7 with GCC 4.2.1 (XCode 3), the following code:

#include <stdio.h>static void somefunc(void){    printf("AAAAAAAAAAAAAAAA.0x%08X.0x%08X.0x%08X.0x%08X.0x%08X.0x%08X.0x%08X.|%s|\n");}int main(void){    char buffer[160] =        "abcdefghijklmnopqrstuvwxyz012345"        "abcdefghijklmnopqrstuvwxyz012345"        "abcdefghijklmnopqrstuvwxyz012345"        "abcdefghijklmnopqrstuvwxyz012345"        "abcdefghijklmnopqrstuvwxyz01234";    somefunc();    return 0;}

produces the following result:

 AAAAAAAAAAAAAAAA.0x000000A0.0xBFFFF11C.0x00001EC4.0x00000000.0x00001E22.0xBFFFF1C8.0x00001E5A.|abcdefghijklmnopqrstuvwxyz012345abcdefghijklmnopqrstuvwxyz012345abcdefghijklmnopqrstuvwxyz012345abcdefghijklmnopqrstuvwxyz012345abcdefghijklmnopqrstuvwxyz01234|

As you can see, I eventually 'found' the string in the main program from the printf() statement. When I compiled it in 64-bit mode, I got a core dump instead. Both results are perfectly correct; the program invokes undefined behaviour, so anything the program does is valid. If you're curious, search for 'nasal demons' for more information on undefined behaviour.

And get used to experimenting with these sorts of issues.


Another variation

#include <stdio.h>static void somefunc(void){    char format[] =        "AAAAAAAAAAAAAAAA.0x%08X.0x%08X.0x%08X.0x%08X.0x%08X\n"        ".0x%08X.0x%08X.0x%08X.0x%08X.0x%08X.0x%08X.0x%08X\n"        ".0x%08X.0x%08X.0x%08X.0x%08X.0x%08X.0x%08X.0x%08X\n";    printf(format, 1);}int main(void){    char buffer[160] =        "abcdefghijklmnopqrstuvwxyz012345"        "abcdefghijklmnopqrstuvwxyz012345"        "abcdefghijklmnopqrstuvwxyz012345"        "abcdefghijklmnopqrstuvwxyz012345"        "abcdefghijklmnopqrstuvwxyz01234";    somefunc();    return 0;}

This produces:

AAAAAAAAAAAAAAAA.0x00000001.0x00000099.0x8FE467B4.0x41000024.0x41414141.0x41414141.0x41414141.0x2E414141.0x30257830.0x302E5838.0x38302578.0x78302E58.0x58383025.0x2578302E.0x2E583830.0x30257830.0x2E0A5838.0x30257830.0x302E5838

You might recognize the format string in the hex output - 0x41 is capital A, for example.

The 64-bit output from that code is both similar and different:

AAAAAAAAAAAAAAAA.0x00000001.0x00000000.0x00000000.0xFFE0082C.0x00000000.0x41414141.0x41414141.0x2578302E.0x30257830.0x38302578.0x58383025.0x0A583830.0x2E583830.0x302E5838.0x78302E58.0x2578302E.0x30257830.0x38302578.0x38302578


You misunderstood the paper.

The text you linked is assuming that the current position on the stack is 0x08480110 (look at the surrounding text). The printf() will dump data from wherever on the stack you happen to be.

The \x10\x01\x48\x08 at the beginning of the format string is merely to print the (assumed) address to stdout in front of the dumped data. In no way do these numbers modify the address from which the data is dumped.