Feature/Variable importance after a PCA analysis

python machine-learning scikit-learn pca feature-selection

First of all, I assume that you call features the variables and not the samples/observations. In this case, you could do something like the following by creating a biplot function that shows everything in one plot. In this example I am using the iris data.

Before the example, please note that the basic idea when using PCA as a tool for feature selection is to select variables according to the magnitude (from largest to smallest in absolute values) of their coefficients (loadings). See my last paragraph after the plot for more details.

Nice article by me here: https://towardsdatascience.com/pca-clearly-explained-how-when-why-to-use-it-and-feature-importance-a-guide-in-python-7c274582c37e?source=friends_link&sk=65bf5440e444c24aff192fedf9f8b64f

Overview:

PART1: I explain how to check the importance of the features and how to plot a biplot.

PART2: I explain how to check the importance of the features and how to save them into a pandas dataframe using the feature names.

PART 1:

import numpy as npimport matplotlib.pyplot as pltfrom sklearn import datasetsfrom sklearn.decomposition import PCAimport pandas as pdfrom sklearn.preprocessing import StandardScaleriris = datasets.load_iris()X = iris.datay = iris.target#In general a good idea is to scale the datascaler = StandardScaler()scaler.fit(X)X=scaler.transform(X)    pca = PCA()x_new = pca.fit_transform(X)def myplot(score,coeff,labels=None):    xs = score[:,0]    ys = score[:,1]    n = coeff.shape[0]    scalex = 1.0/(xs.max() - xs.min())    scaley = 1.0/(ys.max() - ys.min())    plt.scatter(xs * scalex,ys * scaley, c = y)    for i in range(n):        plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)        if labels is None:            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center')        else:            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')plt.xlim(-1,1)plt.ylim(-1,1)plt.xlabel("PC{}".format(1))plt.ylabel("PC{}".format(2))plt.grid()#Call the function. Use only the 2 PCs.myplot(x_new[:,0:2],np.transpose(pca.components_[0:2, :]))plt.show()

Visualize what's going on using the biplot

Now, the importance of each feature is reflected by the magnitude of the corresponding values in the eigenvectors (higher magnitude - higher importance)

Let's see first what amount of variance does each PC explain.

pca.explained_variance_ratio_[0.72770452, 0.23030523, 0.03683832, 0.00515193]

PC1 explains 72% and PC2 23%. Together, if we keep PC1 and PC2 only, they explain 95%.

Now, let's find the most important features.

print(abs( pca.components_ ))[[0.52237162 0.26335492 0.58125401 0.56561105] [0.37231836 0.92555649 0.02109478 0.06541577] [0.72101681 0.24203288 0.14089226 0.6338014 ] [0.26199559 0.12413481 0.80115427 0.52354627]]

Here, pca.components_ has shape [n_components, n_features]. Thus, by looking at the PC1 (First Principal Component) which is the first row: [0.52237162 0.26335492 0.58125401 0.56561105]] we can conclude that feature 1, 3 and 4 (or Var 1, 3 and 4 in the biplot) are the most important.

To sum up, look at the absolute values of the Eigenvectors' components corresponding to the k largest Eigenvalues. In sklearn the components are sorted by explained_variance_. The larger they are these absolute values, the more a specific feature contributes to that principal component.

PART 2:

The important features are the ones that influence more the components and thus, have a large absolute value/score on the component.

To get the most important features on the PCs with names and save them into a pandas dataframe use this:

from sklearn.decomposition import PCAimport pandas as pdimport numpy as npnp.random.seed(0)# 10 samples with 5 featurestrain_features = np.random.rand(10,5)model = PCA(n_components=2).fit(train_features)X_pc = model.transform(train_features)# number of componentsn_pcs= model.components_.shape[0]# get the index of the most important feature on EACH component# LIST COMPREHENSION HEREmost_important = [np.abs(model.components_[i]).argmax() for i in range(n_pcs)]initial_feature_names = ['a','b','c','d','e']# get the namesmost_important_names = [initial_feature_names[most_important[i]] for i in range(n_pcs)]# LIST COMPREHENSION HERE AGAINdic = {'PC{}'.format(i): most_important_names[i] for i in range(n_pcs)}# build the dataframedf = pd.DataFrame(dic.items())

This prints:

     0  1 0  PC0  e 1  PC1  d

So on the PC1 the feature named e is the most important and on PC2 the d.

python machine-learning scikit-learn pca feature-selection

the pca library contains this functionality.

pip install pca

A demonstration to extract the feature importance is as following:

# Import librariesimport numpy as npimport pandas as pdfrom pca import pca# Lets create a dataset with features that have decreasing variance. # We want to extract feature f1 as most important, followed by f2 etcf1=np.random.randint(0,100,250)f2=np.random.randint(0,50,250)f3=np.random.randint(0,25,250)f4=np.random.randint(0,10,250)f5=np.random.randint(0,5,250)f6=np.random.randint(0,4,250)f7=np.random.randint(0,3,250)f8=np.random.randint(0,2,250)f9=np.random.randint(0,1,250)# Combine into dataframeX = np.c_[f1,f2,f3,f4,f5,f6,f7,f8,f9]X = pd.DataFrame(data=X, columns=['f1','f2','f3','f4','f5','f6','f7','f8','f9'])# Initializemodel = pca()# Fit transformout = model.fit_transform(X)# Print the top features. The results show that f1 is best, followed by f2 etcprint(out['topfeat'])#     PC      feature# 0  PC1      f1# 1  PC2      f2# 2  PC3      f3# 3  PC4      f4# 4  PC5      f5# 5  PC6      f6# 6  PC7      f7# 7  PC8      f8# 8  PC9      f9

Plot the explained variance

model.plot()

Make the biplot. It can be nicely seen that the first feature with most variance (f1), is almost horizontal in the plot, whereas the second most variance (f2) is almost vertical. This is expected because most of the variance is in f1, followed by f2 etc.

ax = model.biplot(n_feat=10, legend=False)

Biplot in 3d. Here we see the nice addition of the expected f3 in the plot in the z-direction.

ax = model.biplot3d(n_feat=10, legend=False)

python machine-learning scikit-learn pca feature-selection

# original_num_df the original numeric dataframe# pca is the modeldef create_importance_dataframe(pca, original_num_df):    # Change pcs components ndarray to a dataframe    importance_df  = pd.DataFrame(pca.components_)    # Assign columns    importance_df.columns  = original_num_df.columns    # Change to absolute values    importance_df =importance_df.apply(np.abs)    # Transpose    importance_df=importance_df.transpose()    # Change column names again    ## First get number of pcs    num_pcs = importance_df.shape[1]    ## Generate the new column names    new_columns = [f'PC{i}' for i in range(1, num_pcs + 1)]    ## Now rename    importance_df.columns  =new_columns    # Return importance df    return importance_df# Call function to create importance dfimportance_df  =create_importance_dataframe(pca, original_num_df)# Show first few rowsdisplay(importance_df.head())# Sort depending on PC of interest## PC1 top 10 important featurespc1_top_10_features = importance_df['PC1'].sort_values(ascending = False)[:10]print(), print(f'PC1 top 10 feautres are \n')display(pc1_top_10_features )## PC2 top 10 important featurespc2_top_10_features = importance_df['PC2'].sort_values(ascending = False)[:10]print(), print(f'PC2 top 10 feautres are \n')display(pc2_top_10_features )

CodeHunter

Feature/Variable importance after a PCA analysis

PART 1:

PART 2:

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last