Why does Python code run faster in a function?

python performance profiling benchmarking cpython

Inside a function, the bytecode is:

  2           0 SETUP_LOOP              20 (to 23)              3 LOAD_GLOBAL              0 (xrange)              6 LOAD_CONST               3 (100000000)              9 CALL_FUNCTION            1             12 GET_ITER                    >>   13 FOR_ITER                 6 (to 22)             16 STORE_FAST               0 (i)  3          19 JUMP_ABSOLUTE           13        >>   22 POP_BLOCK                   >>   23 LOAD_CONST               0 (None)             26 RETURN_VALUE

At the top level, the bytecode is:

  1           0 SETUP_LOOP              20 (to 23)              3 LOAD_NAME                0 (xrange)              6 LOAD_CONST               3 (100000000)              9 CALL_FUNCTION            1             12 GET_ITER                    >>   13 FOR_ITER                 6 (to 22)             16 STORE_NAME               1 (i)  2          19 JUMP_ABSOLUTE           13        >>   22 POP_BLOCK                   >>   23 LOAD_CONST               2 (None)             26 RETURN_VALUE

The difference is that STORE_FAST is faster (!) than STORE_NAME. This is because in a function, i is a local but at toplevel it is a global.

To examine bytecode, use the dis module. I was able to disassemble the function directly, but to disassemble the toplevel code I had to use the compile builtin.

python performance profiling benchmarking cpython

You might ask why it is faster to store local variables than globals. This is a CPython implementation detail.

Remember that CPython is compiled to bytecode, which the interpreter runs. When a function is compiled, the local variables are stored in a fixed-size array (not a dict) and variable names are assigned to indexes. This is possible because you can't dynamically add local variables to a function. Then retrieving a local variable is literally a pointer lookup into the list and a refcount increase on the PyObject which is trivial.

Contrast this to a global lookup (LOAD_GLOBAL), which is a true dict search involving a hash and so on. Incidentally, this is why you need to specify global i if you want it to be global: if you ever assign to a variable inside a scope, the compiler will issue STORE_FASTs for its access unless you tell it not to.

By the way, global lookups are still pretty optimised. Attribute lookups foo.bar are the really slow ones!

Here is small illustration on local variable efficiency.

python performance profiling benchmarking cpython

Aside from local/global variable store times, opcode prediction makes the function faster.

As the other answers explain, the function uses the STORE_FAST opcode in the loop. Here's the bytecode for the function's loop:

    >>   13 FOR_ITER                 6 (to 22)   # get next value from iterator         16 STORE_FAST               0 (x)       # set local variable         19 JUMP_ABSOLUTE           13           # back to FOR_ITER

Normally when a program is run, Python executes each opcode one after the other, keeping track of the a stack and preforming other checks on the stack frame after each opcode is executed. Opcode prediction means that in certain cases Python is able to jump directly to the next opcode, thus avoiding some of this overhead.

In this case, every time Python sees FOR_ITER (the top of the loop), it will "predict" that STORE_FAST is the next opcode it has to execute. Python then peeks at the next opcode and, if the prediction was correct, it jumps straight to STORE_FAST. This has the effect of squeezing the two opcodes into a single opcode.

On the other hand, the STORE_NAME opcode is used in the loop at the global level. Python does *not* make similar predictions when it sees this opcode. Instead, it must go back to the top of the evaluation-loop which has obvious implications for the speed at which the loop is executed.

To give some more technical detail about this optimization, here's a quote from the ceval.c file (the "engine" of Python's virtual machine):

Some opcodes tend to come in pairs thus making it possible to predict the second code when the first is run. For example, GET_ITER is often followed by FOR_ITER. And FOR_ITER is often followed by STORE_FAST or UNPACK_SEQUENCE.
Verifying the prediction costs a single high-speed test of a register variable against a constant. If the pairing was good, then the processor's own internal branch predication has a high likelihood of success, resulting in a nearly zero-overhead transition to the next opcode. A successful prediction saves a trip through the eval-loop including its two unpredictable branches, the HAS_ARG test and the switch-case. Combined with the processor's internal branch prediction, a successful PREDICT has the effect of making the two opcodes run as if they were a single new opcode with the bodies combined.

We can see in the source code for the FOR_ITER opcode exactly where the prediction for STORE_FAST is made:

case FOR_ITER:                         // the FOR_ITER opcode case    v = TOP();    x = (*v->ob_type->tp_iternext)(v); // x is the next value from iterator    if (x != NULL) {                             PUSH(x);                       // put x on top of the stack        PREDICT(STORE_FAST);           // predict STORE_FAST will follow - success!        PREDICT(UNPACK_SEQUENCE);      // this and everything below is skipped        continue;    }    // error-checking and more code for when the iterator ends normally

The PREDICT function expands to if (*next_instr == op) goto PRED_##op i.e. we just jump to the start of the predicted opcode. In this case, we jump here:

PREDICTED_WITH_ARG(STORE_FAST);case STORE_FAST:    v = POP();                     // pop x back off the stack    SETLOCAL(oparg, v);            // set it as the new local variable    goto fast_next_opcode;

The local variable is now set and the next opcode is up for execution. Python continues through the iterable until it reaches the end, making the successful prediction each time.

The Python wiki page has more information about how CPython's virtual machine works.

CodeHunter

Why does Python code run faster in a function?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last