Optimising Python dictionary access code Optimising Python dictionary access code python python

Optimising Python dictionary access code


node_after_b == node_a will try to call node_after_b.__eq__(node_a):

>>> class B(object):...     def __eq__(self, other):...         print "B.__eq__()"...         return False... >>> class A(object):...     def __eq__(self, other):...         print "A.__eq__()"...         return False... >>> a = A()>>> b = B()>>> a == bA.__eq__()False>>> b == aB.__eq__()False>>> 

Try to override Node.__eq__() with an optimized version before resorting to C.

UPDATE

I made this little experiment (python 2.6.6):

#!/usr/bin/env python# test.pyclass A(object):    def __init__(self, id):        self.id = idclass B(A):    def __eq__(self, other):        return self.id == other.id@profiledef main():    list_a = []    list_b = []    for x in range(100000):        list_a.append(A(x))        list_b.append(B(x))    ob_a = A(1)    ob_b = B(1)    for ob in list_a:        if ob == ob_a:            x = True        if ob is ob_a:            x = True        if ob.id == ob_a.id:            x = True        if ob.id == 1:            x = True    for ob in list_b:        if ob == ob_b:            x = True        if ob is ob_b:            x = True        if ob.id == ob_b.id:            x = True        if ob.id == 1:            x = Trueif __name__ == '__main__':    main()

Results:

Timer unit: 1e-06 sFile: test.py Function: main at line 10 Total time: 5.52964 sLine #      Hits         Time  Per Hit % Time  Line Contents==============================================================    10                                           @profile    11                                           def main():    12         1            5      5.0      0.0      list_a = []    13         1            3      3.0      0.0      list_b = []    14    100001       360677      3.6      6.5      for x in range(100000):    15    100000       763593      7.6     13.8          list_a.append(A(x))    16    100000       924822      9.2     16.7          list_b.append(B(x))    17    18         1           14     14.0      0.0      ob_a = A(1)    19         1            5      5.0      0.0      ob_b = B(1)    20    100001       500454      5.0      9.1      for ob in list_a:    21    100000       267252      2.7      4.8          if ob == ob_a:    22                                                       x = True    23    100000       259075      2.6      4.7          if ob is ob_a:    24                                                       x = True    25    100000       539683      5.4      9.8          if ob.id == ob_a.id:    26         1            3      3.0      0.0              x = True    27    100000       271519      2.7      4.9          if ob.id == 1:    28         1            3      3.0      0.0              x = True    29    100001       296736      3.0      5.4      for ob in list_b:    30    100000       472204      4.7      8.5          if ob == ob_b:    31         1            4      4.0      0.0              x = True    32    100000       283165      2.8      5.1          if ob is ob_b:    33                                                       x = True    34    100000       298839      3.0      5.4          if ob.id == ob_b.id:    35         1            3      3.0      0.0              x = True    36    100000       291576      2.9      5.3          if ob.id == 1:    37         1            3      3.0      0.0              x = True

I was very surprised:

  • "dot" access (ob.property) seems to be very expensive (line 25 versus line 27).
  • there was not much difference between is and '==', at least for simple objects

Then I tried with more complex objects and results are consistent with the first experiment.

Are you swapping a lot? If your dataset is so large that it does not fit available RAM, I guess you may experience some kind of I/O contention related to virtual memory fetches.

Are you running Linux? If so, could you post a vmstat of your machine while running your program? Send us the output of something like:

vmstat 10 100

Good luck!

UPDATE (from comments by OP)

I sugested playing with sys.setcheckinterval and enable/disable the GC. The rationale is that for this particular case (huge number of instances) the default GC reference count check is somewhat expensive and its default interval is away too often.

Yes, I had previously played with sys.setcheckinterval. I changed it to 1000 (from its default of 100), but it didn't do any measurable difference. Disabling Garbage Collection has helped - thanks. This has been the biggest speedup so far - saving about 20% (171 minutes for the whole run, down to 135 minutes) - I'm not sure what the error bars are on that, but it must be a statistically significant increase. – Adam Nellis Feb 9 at 15:10

My guess:

I think the Python GC is based on reference count. From time to time it will check the reference count for every instance; since you are traversing these huge in-memory structures, in your particular case the GC default frequency (1000 cycles?) is away too often - a huge waste. – Yours Truly Feb 10 at 2:06


Have you considered Pyrex / Cython?

It compiles python to C and then to .pyd automatically, so it might speed things up a fair bit without much work.


This would require a fair amount of work, but...you might consider using Floyd-Warshall running on a GPU. There has been a lot of work done on making Floyd-Warshall run very efficiently on a GPU. A quick google search yields:

http://cvit.iiit.ac.in/papers/Pawan07accelerating.pdf

http://my.safaribooksonline.com/book/programming/graphics/9780321545411/gpu-computing-for-protein-structure-prediction/ch43lev1sec2#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTk3ODAzMjE1NDU0MTEvNDg3

http://www.gpucomputing.net/?q=node/1203

http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter43.html

Even though, as implemented in Python, Floyd-Warshall was slower by an order of magnitude, a good GPU version on a powerful GPU might still significantly outperform your new Python code.

Here's an anecdote. I had a short, simple, compute-intensive piece of code that did something similar to a hough accumulation. In Python, optimized as I could get it, it took ~7s on a speedy i7. I then wrote a completely non-optimized GPU version; it took ~0.002s on an Nvidia GTX 480. YMMV, but for anything significantly parallel, the GPU is likely to be a long term winner, and since it's a well-studied algorithm, you should be able to utilize existing highly-tuned code.

For the Python / GPU bridge, I'd recommend PyCUDA or PyOpenCL.