Progress bar for pandas .corr() method Progress bar for pandas .corr() method pandas pandas

Progress bar for pandas .corr() method


Note: This will not be a really feasible answer due to the increased computation time. From what I have measured, it seams to increase dramatically when using small dataframes (up to factor 40), however when using large dataframes it's around a factor of 2 - 3.

Maybe someone can find a more efficient implementation of the custom function calc_corr_coefs.

I have managed to use pythons tqdm module to show the progress, however this required me to make use of its df.progress_apply() function. Here is some sample code:

import timefrom tqdm import tqdmimport numpy as npimport pandas as pddef calc_corr_coefs(s: pd.Series, df_all: pd.DataFrame) -> pd.Series:    """    calculates the correlation coefficient between one series and all columns in the dataframe    :param s:       pd.Series; the column from which you want to calculate the correlation with all other columns    :param df_all:  pd.DataFrame; the complete dataframe    return:     a series with all the correlation coefficients    """    corr_coef = {}    for col in df_all:        # corr_coef[col] = s.corr(df_all[col])        corr_coef[col] = np.corrcoef(s.values, df_all[col].values)[0, 1]    return pd.Series(data=corr_coef)df = pd.DataFrame(np.random.randint(0, 1000, (10000, 200)))t0 = time.perf_counter()# first use the basic df.corr()df_corr_pd = df.corr()t1 = time.perf_counter()print(f'base df.corr(): {t1 - t0} s')# compare to df.progress_apply()tqdm.pandas(ncols=100)df_corr_cust = df.progress_apply(calc_corr_coefs, axis=0, args=(df,))t2 = time.perf_counter()print(f'with progress bar: {t2 - t1} s')print(f'factor: {(t2 - t1) / (t1 - t0)}')

I hope this helps and someone will be able to speed up the implementation.