What is the most efficient way to loop through dataframes with pandas?

python pandas performance dataframe for-loop

The newest versions of pandas now include a built-in function for iterating over rows.

for index, row in df.iterrows():    # do some logic here

Or, if you want it faster use itertuples()

But, unutbu's suggestion to use numpy functions to avoid iterating over rows will produce the fastest code.

python pandas performance dataframe for-loop

Pandas is based on NumPy arrays.The key to speed with NumPy arrays is to perform your operations on the whole array at once, never row-by-row or item-by-item.

For example, if close is a 1-d array, and you want the day-over-day percent change,

pct_change = close[1:]/close[:-1]

This computes the entire array of percent changes as one statement, instead of

pct_change = []for row in close:    pct_change.append(...)

So try to avoid the Python loop for i, row in enumerate(...) entirely, andthink about how to perform your calculations with operations on the entire array (or dataframe) as a whole, rather than row-by-row.

python pandas performance dataframe for-loop

Like what has been mentioned before, pandas object is most efficient when process the whole array at once. However for those who really need to loop through a pandas DataFrame to perform something, like me, I found at least three ways to do it. I have done a short test to see which one of the three is the least time consuming.

t = pd.DataFrame({'a': range(0, 10000), 'b': range(10000, 20000)})B = []C = []A = time.time()for i,r in t.iterrows():    C.append((r['a'], r['b']))B.append(time.time()-A)C = []A = time.time()for ir in t.itertuples():    C.append((ir[1], ir[2]))    B.append(time.time()-A)C = []A = time.time()for r in zip(t['a'], t['b']):    C.append((r[0], r[1]))B.append(time.time()-A)print B

Result:

[0.5639059543609619, 0.017839908599853516, 0.005645036697387695]

This is probably not the best way to measure the time consumption but it's quick for me.

Here are some pros and cons IMHO:

.iterrows(): return index and row items in separate variables, but significantly slower
.itertuples(): faster than .iterrows(), but return index together with row items, ir[0] is the index
zip: quickest, but no access to index of the row

EDIT 2020/11/10

For what it is worth, here is an updated benchmark with some other alternatives (perf with MacBookPro 2,4 GHz Intel Core i9 8 cores 32 Go 2667 MHz DDR4)

import sysimport tqdmimport timeimport pandas as pdB = []t = pd.DataFrame({'a': range(0, 10000), 'b': range(10000, 20000)})for _ in tqdm.tqdm(range(10)):    C = []    A = time.time()    for i,r in t.iterrows():        C.append((r['a'], r['b']))    B.append({"method": "iterrows", "time": time.time()-A})    C = []    A = time.time()    for ir in t.itertuples():        C.append((ir[1], ir[2]))    B.append({"method": "itertuples", "time": time.time()-A})    C = []    A = time.time()    for r in zip(t['a'], t['b']):        C.append((r[0], r[1]))    B.append({"method": "zip", "time": time.time()-A})    C = []    A = time.time()    for r in zip(*t.to_dict("list").values()):        C.append((r[0], r[1]))    B.append({"method": "zip + to_dict('list')", "time": time.time()-A})    C = []    A = time.time()    for r in t.to_dict("records"):        C.append((r["a"], r["b"]))    B.append({"method": "to_dict('records')", "time": time.time()-A})    A = time.time()    t.agg(tuple, axis=1).tolist()    B.append({"method": "agg", "time": time.time()-A})    A = time.time()    t.apply(tuple, axis=1).tolist()    B.append({"method": "apply", "time": time.time()-A})print(f'Python {sys.version} on {sys.platform}')print(f"Pandas version {pd.__version__}")print(    pd.DataFrame(B).groupby("method").agg(["mean", "std"]).xs("time", axis=1).sort_values("mean"))## OutputPython 3.7.9 (default, Oct 13 2020, 10:58:24) [Clang 12.0.0 (clang-1200.0.32.2)] on darwinPandas version 1.1.4                           mean       stdmethod                                   zip + to_dict('list')  0.002353  0.000168zip                    0.003381  0.000250itertuples             0.007659  0.000728to_dict('records')     0.025838  0.001458agg                    0.066391  0.007044apply                  0.067753  0.006997iterrows               0.647215  0.019600

CodeHunter

What is the most efficient way to loop through dataframes with pandas?

EDIT 2020/11/10

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last