Reshaping a pandas correlation matrix
You need add reset_index
:
#reset columns and index names df = df.rename_axis(None).rename_axis(None, axis=1)#if pandas version below 0.18.0#df.columns.name = None#df.index.name = Noneprint (df) aaa bbb ccc ddd eeeaaa 1.000000 0.346099 0.131874 -0.150910 0.177589bbb 0.346099 1.000000 0.177308 -0.384893 0.301150ccc 0.131874 0.177308 1.000000 -0.176995 0.258812ddd -0.150910 -0.384893 -0.176995 1.000000 -0.310137eee 0.177589 0.301150 0.258812 -0.310137 1.000000
df1 = df.stack().reset_index()#set column namesdf1.columns = ['a','b','c']print (df1) a b c0 aaa aaa 1.0000001 aaa bbb 0.3460992 aaa ccc 0.1318743 aaa ddd -0.1509104 aaa eee 0.1775895 bbb aaa 0.3460996 bbb bbb 1.0000007 bbb ccc 0.1773088 bbb ddd -0.3848939 bbb eee 0.30115010 ccc aaa 0.13187411 ccc bbb 0.17730812 ccc ccc 1.00000013 ccc ddd -0.17699514 ccc eee 0.25881215 ddd aaa -0.15091016 ddd bbb -0.38489317 ddd ccc -0.17699518 ddd ddd 1.00000019 ddd eee -0.31013720 eee aaa 0.17758921 eee bbb 0.30115022 eee ccc 0.25881223 eee ddd -0.31013724 eee eee 1.000000
Use the code below to (a) reshape the correlation matrix, (b) remove duplicate rows (e.g., {aaa, bbb}
and {bbb, aaa}
), and (c) remove rows that contain the same variable in the first two columns (e.g., {aaa, aaa}
):
# calculate the correlation matrix and reshapedf_corr = df.corr().stack().reset_index()# rename the columnsdf_corr.columns = ['FEATURE_1', 'FEATURE_2', 'CORRELATION']# create a mask to identify rows with duplicate features as mentioned abovemask_dups = (df_corr[['FEATURE_1', 'FEATURE_2']].apply(frozenset, axis=1).duplicated()) | (df_corr['FEATURE_1']==df_corr['FEATURE_2']) # apply the mask to clean the correlation dataframedf_corr = df_corr[~mask_dups]
This will generate an output like this:
FEATURE_1 FEATURE_2 CORRELATION0 aaa bbb 0.3460991 aaa ccc 0.1318742 aaa ddd -0.1509103 aaa eee 0.1775894 bbb ccc 0.1773085 bbb ddd -0.3848936 bbb eee 0.3011507 ccc ddd -0.1769958 ccc eee 0.2588129 ddd eee -0.310137