Pandas: Why should appending to a dataframe of floats and ints be slower than if its full of NaN Pandas: Why should appending to a dataframe of floats and ints be slower than if its full of NaN pandas pandas

Pandas: Why should appending to a dataframe of floats and ints be slower than if its full of NaN


As described in this blog post by the main author of pandas, a pandas DataFrame is internally made up of "blocks". A block is a group of columns all having the same datatype. Each block is stored as a numpy array of its block type. So if you have five int columns and then five float columns, there will be an int block and a float block.

Appending to a multi-type array requires appending to each of the underlying numpy arrays. Appending to numpy arrays is slow, because it requires creating a whole new numpy array. So it makes sense that appending to a multi-type DataFrame is slow: if all the columns are one type, it only has to create one new numpy array, but if they're different types, it has to create multiple new numpy arrays.

It is true that keeping the data all the same type will speed this up. However, I would say the main conclusion is not "if efficiency is important, keep all your columns the same type". The conclusion is if efficiency is important, do not try to append to your arrays/DataFrames.

This is just how numpy works. The slowest part of working with numpy arrays is creating them in the first place. They have a fixed size, and when you "append" to one, you really are just creating an entirely new one with the new size, which is slow. If you absolutely must append to them, you can try stuff like messing with types to ease the pain somewhat. But ultimately you just have to accept that any time you try to append to a DataFrame (or a numpy array in general), you will likely suffer a substantial performance hit.