Correlation among multiple categorical variables (Pandas) Correlation among multiple categorical variables (Pandas) python python

Correlation among multiple categorical variables (Pandas)


You can using pd.factorize

df.apply(lambda x : pd.factorize(x)[0]).corr(method='pearson', min_periods=1)Out[32]:      a    c    da  1.0  1.0  1.0c  1.0  1.0  1.0d  1.0  1.0  1.0

Data input

df=pd.DataFrame({'a':['a','b','c'],'c':['a','b','c'],'d':['a','b','c']})

Update

from scipy.stats import chisquaredf=df.apply(lambda x : pd.factorize(x)[0])+1pd.DataFrame([chisquare(df[x].values,f_exp=df.values.T,axis=1)[0] for x in df])Out[123]:      0    1    2    30  0.0  0.0  0.0  0.01  0.0  0.0  0.0  0.02  0.0  0.0  0.0  0.03  0.0  0.0  0.0  0.0df=pd.DataFrame({'a':['a','d','c'],'c':['a','b','c'],'d':['a','b','c'],'e':['a','b','c']})


Turns out, the only solution I found is to iterate trough all the factor*factor pairs.

factors_paired = [(i,j) for i in df.columns.values for j in df.columns.values] chi2, p_values =[], []for f in factors_paired:    if f[0] != f[1]:        chitest = chi2_contingency(pd.crosstab(df[f[0]], df[f[1]]))           chi2.append(chitest[0])        p_values.append(chitest[1])    else:      # for same factor pair        chi2.append(0)        p_values.append(0)chi2 = np.array(chi2).reshape((23,23)) # shape it as a matrixchi2 = pd.DataFrame(chi2, index=df.columns.values, columns=df.columns.values) # then a df for convenience