Why are dict lookups always better than list lookups?

python performance optimization big-o python-internals

I know lists use C arrays under the hood which made me conclude that lookup in a list with just a few items would be better than in a dictionary (accessing a few elements in an array is faster than computing a hash).

Accessing a few array elements is cheap, sure, but computing == is surprisingly heavyweight in Python. See that spike in your second graph? That's the cost of computing == for two ints right there.

Your list lookups need to compute == a lot more than your dict lookups do.

Meanwhile, computing hashes might be a pretty heavyweight operation for a lot of objects, but for all ints involved here, they just hash to themselves. (-1 would hash to -2, and large integers (technically longs) would hash to smaller integers, but that doesn't apply here.)

Dict lookup isn't really that bad in Python, especially when your keys are just a consecutive range of ints. All ints here hash to themselves, and Python uses a custom open addressing scheme instead of chaining, so all your keys end up nearly as contiguous in memory as if you'd used a list (which is to say, the pointers to the keys end up in a contiguous range of PyDictEntrys). The lookup procedure is fast, and in your test cases, it always hits the right key on the first probe.

Okay, back to the spike in graph 2. The spike in the lookup times at 1024 entries in the second graph is because for all smaller sizes, the integers you were looking for were all <= 256, so they all fell within the range of CPython's small integer cache. The reference implementation of Python keeps canonical integer objects for all integers from -5 to 256, inclusive. For these integers, Python was able to use a quick pointer comparison to avoid going through the (surprisingly heavyweight) process of computing ==. For larger integers, the argument to in was no longer the same object as the matching integer in the dict, and Python had to go through the whole == process.

python performance optimization big-o python-internals

The short answer is that lists use linear search and dicts use amortized O(1) search.

In addition, dict searches can skip an equality test either when 1) hash values don't match or 2) when there is an identity match. Lists only benefit from the identity-implies equality optimization.

Back in 2008, I gave a talk on this subject where you'll find all the details: https://www.youtube.com/watch?v=hYUsssClE94

Roughly the logic for searching lists is:

for element in s:    if element is target:        # fast check for identity implies equality        return True    if element == target:        # slower check for actual equality        return Truereturn False

For dicts the logic is roughly:

h = hash(target)for i in probe_sequence(h, len(table)):    element = key_table[i]    if element is UNUSED:        raise KeyError(target)    if element is target:        # fast path for identity implies equality        return value_table[i]    if h != h_table[i]:        # unequal hashes implies unequal keys        continue    if element == target:        # slower check for actual equality        return value_table[i]

Dictionary hash tables are typically between one-third and two-thirds full, so they tend to have few collisions (few trips around the loop shown above) regardless of size. Also, the hash value check prevents needless slow equality checks (the chance of a wasted equality check is about 1 in 2**64).

If your timing focuses on integers, there are some other effects at play as well. That hash of a int is the int itself, so hashing is very fast. Also, it means that if you're storing consecutive integers, there tend to be no collisions at all.

python performance optimization big-o python-internals

You say "accessing a few elements in an array is faster than computing a hash".

A simple hashing rule for strings might be just a sum (with a modulo in the end). This is a branchless operation that can compare favorably with character comparisons, especially when there is a long match on the prefix.

CodeHunter

Why are dict lookups always better than list lookups?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last