Correlation among multiple categorical variables (Pandas)
You can using pd.factorize
df.apply(lambda x : pd.factorize(x)[0]).corr(method='pearson', min_periods=1)Out[32]: a c da 1.0 1.0 1.0c 1.0 1.0 1.0d 1.0 1.0 1.0
Data input
df=pd.DataFrame({'a':['a','b','c'],'c':['a','b','c'],'d':['a','b','c']})
Update
from scipy.stats import chisquaredf=df.apply(lambda x : pd.factorize(x)[0])+1pd.DataFrame([chisquare(df[x].values,f_exp=df.values.T,axis=1)[0] for x in df])Out[123]: 0 1 2 30 0.0 0.0 0.0 0.01 0.0 0.0 0.0 0.02 0.0 0.0 0.0 0.03 0.0 0.0 0.0 0.0df=pd.DataFrame({'a':['a','d','c'],'c':['a','b','c'],'d':['a','b','c'],'e':['a','b','c']})
Turns out, the only solution I found is to iterate trough all the factor*factor pairs.
factors_paired = [(i,j) for i in df.columns.values for j in df.columns.values] chi2, p_values =[], []for f in factors_paired: if f[0] != f[1]: chitest = chi2_contingency(pd.crosstab(df[f[0]], df[f[1]])) chi2.append(chitest[0]) p_values.append(chitest[1]) else: # for same factor pair chi2.append(0) p_values.append(0)chi2 = np.array(chi2).reshape((23,23)) # shape it as a matrixchi2 = pd.DataFrame(chi2, index=df.columns.values, columns=df.columns.values) # then a df for convenience