Implementing PCA with Numpy

Comparing the two methods

The issue is with not standardizing the data and the extraction of eigen vectors (principal axes). This function compares the two methods.

import numpy as npfrom sklearn import datasetsfrom sklearn.decomposition import PCAimport matplotlib.pyplot as pltfrom mpl_toolkits.mplot3d import Axes3D def pca_comparison(X, n_components, labels):  """X: Standardized dataset, observations on rows     n_components: dimensionality of the reduced space     labels: targets, for visualization  """  # numpy  # -----  # calculate eigen values  X_cov = np.cov(X.T)  e_values, e_vectors = np.linalg.eigh(X_cov)  # Sort eigenvalues and their eigenvectors in descending order  e_ind_order = np.flip(e_values.argsort())  e_values = e_values[e_ind_order]  e_vectors = e_vectors[:, e_ind_order] # note that we have to re-order the columns, not rows  # now we can project the dataset on to the eigen vectors (principal axes)  prin_comp_evd = X @ e_vectors  # sklearn  # -------  pca = PCA(n_components=n_components)  prin_comp_sklearn = pca.fit_transform(X)  # plotting  if n_components == 3:    fig = plt.figure(figsize=(10, 5))    ax = fig.add_subplot(121, projection='3d')    ax.scatter(prin_comp_sklearn[:, 0],                  prin_comp_sklearn[:, 1],                  prin_comp_sklearn[:, 1],                  c=labels)    ax.set_title("sklearn plot")    ax = fig.add_subplot(122, projection='3d')    ax.scatter(prin_comp_evd[:, 0],                  prin_comp_evd[:, 1],                  prin_comp_evd[:, 2],                  c=labels)    ax.set_title("PCA using EVD plot")    fig.suptitle(f"Plots for reducing to {n_components}-D")    plt.show()  elif n_components == 2:    fig, ax = plt.subplots(1, 2, figsize=(10, 5))    ax[0].scatter(prin_comp_sklearn[:, 0],                 prin_comp_sklearn[:, 1],                c=labels)    ax[0].set_title("sklearn plot")    ax[1].scatter(prin_comp_evd[:, 0],                 prin_comp_evd[:, 1],                c=labels)    ax[1].set_title("PCA using EVD plot")    fig.suptitle(f"Plots for reducing to {n_components}-D")    plt.show()  elif n_components == 1:    fig, ax = plt.subplots(1, 2, figsize=(10, 5))    ax[0].scatter(prin_comp_sklearn[:, 0],                 np.zeros_like(prin_comp_sklearn[:, 0]),                c=labels)    ax[0].set_title("sklearn plot")    ax[1].scatter(prin_comp_evd[:, 0],                 np.zeros_like(prin_comp_evd[:, 0]),                c=labels)    ax[1].set_title("PCA using EVD plot")    fig.suptitle(f"Plots for reducing to {n_components}-D")    plt.show()  return prin_comp_sklearn, prin_comp_evd[:, :n_components]

Loading the data set, pre-processing and running the experiment:

dataset = datasets.load_iris()X = dataset.datamean = np.mean(X, axis=0)# this was missing in your implementationstd = np.std(X, axis=0)X_std = (X - mean) / stdfor n in [3, 2, 1]:  pca_comparison(X_std, n, dataset.target)

Results

3D plot is a bit cluttered, but if you look at the 2D and 1D cases, you'll notice the plots are the same if we multiply the first principal component by -1; scikit-learn PCA implementation uses Singular Value Decomposition under the hood, which will give non-unique solutions (see here).

Test:

Using the flip_signs() function from here

def flip_signs(A, B):    """    utility function for resolving the sign ambiguity in SVD    http://stats.stackexchange.com/q/34396/115202    """    signs = np.sign(A) * np.sign(B)    return A, B * signsfor n in [3, 2, 1]:    sklearn_pca, evd = pca_comparison(X_std, n, dataset.target)    assert np.allclose(*flip_signs(sklearn_pca, evd))

Issues in your implementation:

If we look at the data scales in the iris data set, the data are of different scales. This suggests that we should standardize the data (Read here for more)

Quoting a part of the above answer:

Continued by @ttnphns
When would one prefer to do PCA (or factor analysis or other similar type of analysis) on correlations (i.e. on z-standardized variables) instead of doing it on covariances (i.e. on centered variables)?
When the variables are different units of measurement. That's clear
...

In order to obtain the principal axes using e_values, e_vectors = np.linalg.eigh(X_cov), you should extract columns of e_vectors (documentation). You were extracting rows.

CodeHunter

Implementing PCA with Numpy

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last