Using pandas, calculate Cramér's coefficient matrix Using pandas, calculate Cramér's coefficient matrix pandas pandas

Using pandas, calculate Cramér's coefficient matrix


cramers V seems pretty over optimistic in a few tests that I did. Wikipedia recommends a corrected version.

import scipy.stats as ssdef cramers_corrected_stat(confusion_matrix):    """ calculate Cramers V statistic for categorial-categorial association.        uses correction from Bergsma and Wicher,         Journal of the Korean Statistical Society 42 (2013): 323-328    """    chi2 = ss.chi2_contingency(confusion_matrix)[0]    n = confusion_matrix.sum()    phi2 = chi2/n    r,k = confusion_matrix.shape    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))        rcorr = r - ((r-1)**2)/(n-1)    kcorr = k - ((k-1)**2)/(n-1)    return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))

Also note that the confusion matrix can be calculated via a built-in pandas method for categorical columns via:

import pandas as pdconfusion_matrix = pd.crosstab(df[column1], df[column2])


A bit modificated function from Ziggy Eunicien answer. 2 modifications added 1) cheching one variable is constant2) correction to ss.chi2_contingency(conf_matrix, correction=correct) - FALSE if confusion matrix is 2x2

import scipy.stats as ssimport pandas as pdimport numpy as npdef cramers_corrected_stat(x,y):    """ calculate Cramers V statistic for categorial-categorial association.        uses correction from Bergsma and Wicher,         Journal of the Korean Statistical Society 42 (2013): 323-328    """    result=-1    if len(x.value_counts())==1 :        print("First variable is constant")    elif len(y.value_counts())==1:        print("Second variable is constant")    else:           conf_matrix=pd.crosstab(x, y)        if conf_matrix.shape[0]==2:            correct=False        else:            correct=True        chi2 = ss.chi2_contingency(conf_matrix, correction=correct)[0]        n = sum(conf_matrix.sum())        phi2 = chi2/n        r,k = conf_matrix.shape        phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))            rcorr = r - ((r-1)**2)/(n-1)        kcorr = k - ((k-1)**2)/(n-1)        result=np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))    return round(result,6)


Cramer's V statistic allows to understand correlation between two categorical features in one data set. So, it is your case.

To calculate Cramers V statistic you need to calculate confusion matrix. So, solution steps are:
1. Filter data for a single metric
2. Calculate confusion matrix
3. Calculate Cramers V statistic

Of course, you can do those steps in loop nest provided in your post. But in your starting paragraph you mention only metrics as an outer parameter, so I am not sure that you need both loops. Now, I will provide code for steps 2-3, because filtering is simple and as I mentioned I am not sure what you certainely need.

Step 2. In the code below data is a pandas.dataFrame filtered by whatever you want on step 1.

import numpy as npconfusions = []for nation in list_of_nations:    for language in list_of_languges:        cond = data['nation'] == nation and data['lang'] == language        confusions.append(cond.sum())confusion_matrix = np.array(confusions).reshape(len(list_of_nations), len(list_of_languges))

Step 3. In the code below confusion_matrix is a numpy.ndarray obtained on step 2.

import numpy as npimport scipy.stats as ssdef cramers_stat(confusion_matrix):    chi2 = ss.chi2_contingency(confusion_matrix)[0]    n = confusion_matrix.sum()    return np.sqrt(chi2 / (n*(min(confusion_matrix.shape)-1)))result = cramers_stat(confusion_matrix)

This code was tested on my data set, but I hope it is ok to use it without changes in your case.