Python slow read performance issue Python slow read performance issue python python

Python slow read performance issue


I will focus on only one of your examples, because rest things should be analogical:

What I think, may matter in this situation is Read-Ahead (or maybe another technique related to this) feature:

Let consider such example:

I have created 1000 xml files in "1" dir (names 1.xml to 1000.xml) as you did by dd command and then I copied orginal dir 1 to dir 2

$ mkdir 1$ cd 1$ for i in {1..1000}; do dd if=/dev/urandom of=$i.xml bs=1K count=10; done$ cd ..$ cp -r 1 2$ sync; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'$ time strace -f -c -o trace.copy2c cp -r 2 2copy$ sync; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'$ time strace -f -c -o trace.copy1c cp -r 1 1copy

In the next step I debugged cp command (by strace) to found out in what order data are copied:

So cp does it in following order (only first 4 files, because I saw that the second read from original directory is more time consuming that second read from copied directory)

100.xml 150.xml58.xml64.xml...* in my example

Now, take a look on filesystem blocks which are used by these files (debugfs output - ext3 fs):

Original directory:

BLOCKS:(0-9):63038-63047 100.xml(0-9):64091-64100 150.xml(0-9):57926-57935 58.xml(0-9):60959-60968 64.xml....Copied directory:BLOCKS:(0-9):65791-65800 100.xml(0-9):65801-65810 150.xml(0-9):65811-65820 58.xml(0-9):65821-65830 64.xml

....

As you can see, in the "Copied directory" the block are adjacent, so it means that during reading of the first file 100.xml the "Read Ahead"technique (controller or system settings) can increase performance.

dd create file in order 1.xml to 1000.xml, but cp command copies it in another order (100.xml, 150.xml, 58.xml,64.xml).So when you execute:

cp -r 1 1copy

to copy this dir to another, the blocks of files which you are copied are not adjacent, so read of such files take more time.

When you copy dir which you copied by cp command (so files are not created by dd command), then file are adjacent so creating:

cp -r 2 2copy 

copy of the copy is faster.

Summary:So to test performance python/perl you should use the same dir (or two dirs copied by cp command) and also you can use option O_DIRECT to read bypassingall kernel buffers and read data from disk directly.

Please remember, that results can be different on different type of kernel, system, disk controller, system settings, fs and so on.

Additions:

 [debugfs] [root@dhcppc3 test]# debugfs /dev/sda1 debugfs 1.39 (29-May-2006)debugfs:  cd testdebugfs:  stat test.xmlInode: 24102   Type: regular    Mode:  0644   Flags: 0x0   Generation: 3385884179User:     0   Group:     0   Size: 4File ACL: 0    Directory ACL: 0Links: 1   Blockcount: 2Fragment:  Address: 0    Number: 0    Size: 0ctime: 0x543274bf -- Mon Oct  6 06:53:51 2014atime: 0x543274be -- Mon Oct  6 06:53:50 2014mtime: 0x543274bf -- Mon Oct  6 06:53:51 2014BLOCKS:(0):29935TOTAL: 1debugfs: