# What is the most efficient way to loop through dataframes with pandas?

The newest versions of pandas now include a built-in function for iterating over rows.

`for index, row in df.iterrows(): # do some logic here`

Or, if you want it faster use `itertuples()`

But, unutbu's suggestion to use numpy functions to avoid iterating over rows will produce the fastest code.

Pandas is based on NumPy arrays.The key to speed with NumPy arrays is to perform your operations on the whole array at once, never row-by-row or item-by-item.

For example, if `close`

is a 1-d array, and you want the day-over-day percent change,

`pct_change = close[1:]/close[:-1]`

This computes the entire array of percent changes as one statement, instead of

`pct_change = []for row in close: pct_change.append(...)`

So try to avoid the Python loop `for i, row in enumerate(...)`

entirely, andthink about how to perform your calculations with operations on the entire array (or dataframe) as a whole, rather than row-by-row.

Like what has been mentioned before, pandas object is most efficient when process the whole array at once. However for those who really need to loop through a pandas DataFrame to perform something, like me, I found at least three ways to do it. I have done a short test to see which one of the three is the least time consuming.

`t = pd.DataFrame({'a': range(0, 10000), 'b': range(10000, 20000)})B = []C = []A = time.time()for i,r in t.iterrows(): C.append((r['a'], r['b']))B.append(time.time()-A)C = []A = time.time()for ir in t.itertuples(): C.append((ir[1], ir[2])) B.append(time.time()-A)C = []A = time.time()for r in zip(t['a'], t['b']): C.append((r[0], r[1]))B.append(time.time()-A)print B`

Result:

`[0.5639059543609619, 0.017839908599853516, 0.005645036697387695]`

This is probably not the best way to measure the time consumption but it's quick for me.

Here are some pros and cons IMHO:

- .iterrows(): return index and row items in separate variables, but significantly slower
- .itertuples(): faster than .iterrows(), but return index together with row items, ir[0] is the index
- zip: quickest, but no access to index of the row

## EDIT 2020/11/10

For what it is worth, here is an updated benchmark with some other alternatives (perf with MacBookPro 2,4 GHz Intel Core i9 8 cores 32 Go 2667 MHz DDR4)

`import sysimport tqdmimport timeimport pandas as pdB = []t = pd.DataFrame({'a': range(0, 10000), 'b': range(10000, 20000)})for _ in tqdm.tqdm(range(10)): C = [] A = time.time() for i,r in t.iterrows(): C.append((r['a'], r['b'])) B.append({"method": "iterrows", "time": time.time()-A}) C = [] A = time.time() for ir in t.itertuples(): C.append((ir[1], ir[2])) B.append({"method": "itertuples", "time": time.time()-A}) C = [] A = time.time() for r in zip(t['a'], t['b']): C.append((r[0], r[1])) B.append({"method": "zip", "time": time.time()-A}) C = [] A = time.time() for r in zip(*t.to_dict("list").values()): C.append((r[0], r[1])) B.append({"method": "zip + to_dict('list')", "time": time.time()-A}) C = [] A = time.time() for r in t.to_dict("records"): C.append((r["a"], r["b"])) B.append({"method": "to_dict('records')", "time": time.time()-A}) A = time.time() t.agg(tuple, axis=1).tolist() B.append({"method": "agg", "time": time.time()-A}) A = time.time() t.apply(tuple, axis=1).tolist() B.append({"method": "apply", "time": time.time()-A})print(f'Python {sys.version} on {sys.platform}')print(f"Pandas version {pd.__version__}")print( pd.DataFrame(B).groupby("method").agg(["mean", "std"]).xs("time", axis=1).sort_values("mean"))## OutputPython 3.7.9 (default, Oct 13 2020, 10:58:24) [Clang 12.0.0 (clang-1200.0.32.2)] on darwinPandas version 1.1.4 mean stdmethod zip + to_dict('list') 0.002353 0.000168zip 0.003381 0.000250itertuples 0.007659 0.000728to_dict('records') 0.025838 0.001458agg 0.066391 0.007044apply 0.067753 0.006997iterrows 0.647215 0.019600`