What is the most efficient way to loop through dataframes with pandas? What is the most efficient way to loop through dataframes with pandas? python python

What is the most efficient way to loop through dataframes with pandas?


The newest versions of pandas now include a built-in function for iterating over rows.

for index, row in df.iterrows():    # do some logic here

Or, if you want it faster use itertuples()

But, unutbu's suggestion to use numpy functions to avoid iterating over rows will produce the fastest code.


Pandas is based on NumPy arrays.The key to speed with NumPy arrays is to perform your operations on the whole array at once, never row-by-row or item-by-item.

For example, if close is a 1-d array, and you want the day-over-day percent change,

pct_change = close[1:]/close[:-1]

This computes the entire array of percent changes as one statement, instead of

pct_change = []for row in close:    pct_change.append(...)

So try to avoid the Python loop for i, row in enumerate(...) entirely, andthink about how to perform your calculations with operations on the entire array (or dataframe) as a whole, rather than row-by-row.


Like what has been mentioned before, pandas object is most efficient when process the whole array at once. However for those who really need to loop through a pandas DataFrame to perform something, like me, I found at least three ways to do it. I have done a short test to see which one of the three is the least time consuming.

t = pd.DataFrame({'a': range(0, 10000), 'b': range(10000, 20000)})B = []C = []A = time.time()for i,r in t.iterrows():    C.append((r['a'], r['b']))B.append(time.time()-A)C = []A = time.time()for ir in t.itertuples():    C.append((ir[1], ir[2]))    B.append(time.time()-A)C = []A = time.time()for r in zip(t['a'], t['b']):    C.append((r[0], r[1]))B.append(time.time()-A)print B

Result:

[0.5639059543609619, 0.017839908599853516, 0.005645036697387695]

This is probably not the best way to measure the time consumption but it's quick for me.

Here are some pros and cons IMHO:

  • .iterrows(): return index and row items in separate variables, but significantly slower
  • .itertuples(): faster than .iterrows(), but return index together with row items, ir[0] is the index
  • zip: quickest, but no access to index of the row

EDIT 2020/11/10

For what it is worth, here is an updated benchmark with some other alternatives (perf with MacBookPro 2,4 GHz Intel Core i9 8 cores 32 Go 2667 MHz DDR4)

import sysimport tqdmimport timeimport pandas as pdB = []t = pd.DataFrame({'a': range(0, 10000), 'b': range(10000, 20000)})for _ in tqdm.tqdm(range(10)):    C = []    A = time.time()    for i,r in t.iterrows():        C.append((r['a'], r['b']))    B.append({"method": "iterrows", "time": time.time()-A})    C = []    A = time.time()    for ir in t.itertuples():        C.append((ir[1], ir[2]))    B.append({"method": "itertuples", "time": time.time()-A})    C = []    A = time.time()    for r in zip(t['a'], t['b']):        C.append((r[0], r[1]))    B.append({"method": "zip", "time": time.time()-A})    C = []    A = time.time()    for r in zip(*t.to_dict("list").values()):        C.append((r[0], r[1]))    B.append({"method": "zip + to_dict('list')", "time": time.time()-A})    C = []    A = time.time()    for r in t.to_dict("records"):        C.append((r["a"], r["b"]))    B.append({"method": "to_dict('records')", "time": time.time()-A})    A = time.time()    t.agg(tuple, axis=1).tolist()    B.append({"method": "agg", "time": time.time()-A})    A = time.time()    t.apply(tuple, axis=1).tolist()    B.append({"method": "apply", "time": time.time()-A})print(f'Python {sys.version} on {sys.platform}')print(f"Pandas version {pd.__version__}")print(    pd.DataFrame(B).groupby("method").agg(["mean", "std"]).xs("time", axis=1).sort_values("mean"))## OutputPython 3.7.9 (default, Oct 13 2020, 10:58:24) [Clang 12.0.0 (clang-1200.0.32.2)] on darwinPandas version 1.1.4                           mean       stdmethod                                   zip + to_dict('list')  0.002353  0.000168zip                    0.003381  0.000250itertuples             0.007659  0.000728to_dict('records')     0.025838  0.001458agg                    0.066391  0.007044apply                  0.067753  0.006997iterrows               0.647215  0.019600