Pandas: Why should appending to a dataframe of floats and ints be slower than if its full of NaN

python casting pandas

As described in this blog post by the main author of pandas, a pandas DataFrame is internally made up of "blocks". A block is a group of columns all having the same datatype. Each block is stored as a numpy array of its block type. So if you have five int columns and then five float columns, there will be an int block and a float block.

Appending to a multi-type array requires appending to each of the underlying numpy arrays. Appending to numpy arrays is slow, because it requires creating a whole new numpy array. So it makes sense that appending to a multi-type DataFrame is slow: if all the columns are one type, it only has to create one new numpy array, but if they're different types, it has to create multiple new numpy arrays.

It is true that keeping the data all the same type will speed this up. However, I would say the main conclusion is not "if efficiency is important, keep all your columns the same type". The conclusion is if efficiency is important, do not try to append to your arrays/DataFrames.

This is just how numpy works. The slowest part of working with numpy arrays is creating them in the first place. They have a fixed size, and when you "append" to one, you really are just creating an entirely new one with the new size, which is slow. If you absolutely must append to them, you can try stuff like messing with types to ease the pain somewhat. But ultimately you just have to accept that any time you try to append to a DataFrame (or a numpy array in general), you will likely suffer a substantial performance hit.

CodeHunter

Pandas: Why should appending to a dataframe of floats and ints be slower than if its full of NaN

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last