What's wrong with my PCA?

You decomposed the wrong matrix.

Principal Component Analysis requires manipulating the eigenvectors/eigenvaluesof the covariance matrix, not the data itself. The covariance matrix, created from an m x n data matrix, will be an m x m matrix with ones along the main diagonal.

You can indeed use the cov function, but you need further manipulation of your data. It's probably a little easier to use a similar function, corrcoef:

import numpy as NPimport numpy.linalg as LA# a simulated data set with 8 data points, each point having five featuresdata = NP.random.randint(0, 10, 40).reshape(8, 5)# usually a good idea to mean center your data first:data -= NP.mean(data, axis=0)# calculate the covariance matrix C = NP.corrcoef(data, rowvar=0)# returns an m x m matrix, or here a 5 x 5 matrix)# now get the eigenvalues/eigenvectors of C:eval, evec = LA.eig(C)

To get the eigenvectors/eigenvalues, I did not decompose the covariance matrix using SVD, though, you certainly can. My preference is to calculate them using eig in NumPy's (or SciPy's) LA module--it is a little easier to work with than svd, the return values are the eigenvectors and eigenvalues themselves, and nothing else. By contrast, as you know, svd doesn't return these these directly.

Granted the SVD function will decompose any matrix, not just square ones (to which the eig function is limited); however when doing PCA, you'll always have a square matrix to decompose,regardless of the form that your data is in. This is obvious because the matrix youare decomposing in PCA is a covariance matrix, which by definition is always square (i.e., the columns are the individual data points of the original matrix, likewisefor the rows, and each cell is the covariance of those two points, as evidencedby the ones down the main diagonal--a given data point has perfect covariance with itself).

python numpy linear-algebra pca

The left singular values returned by SVD(A) are the eigenvectors of AA^T.

The covariance matrix of a dataset A is : 1/(N-1) * AA^T

Now, when you do PCA by using the SVD, you have to divide each entry in your A matrix by (N-1) so you get the eigenvalues of the covariance with the correct scale.

In your case, N=150 and you haven't done this division, hence the discrepancy.

This is explained in detail here

python numpy linear-algebra pca

(Can you ask one question, please? Or at least list your questions separately. Your post reads like a stream of consciousness because you are not asking one single question.)

You probably used cov incorrectly by not transposing the matrix first. If cov_mat is 4-by-4, then eig will produce four eigenvalues and four eigenvectors.
Note how SVD and PCA, while related, are not exactly the same. Let X be a 4-by-150 matrix of observations where each 4-element column is a single observation. Then, the following are equivalent:
a. the left singular vectors of X,
b. the principal components of X,
c. the eigenvectors of X X^T.
Also, the eigenvalues of X X^T are equal to the square of the singular values of X. To see all this, let X have the SVD X = QSV^T, where S is a diagonal matrix of singular values. Then consider the eigendecomposition D = Q^T X X^T Q, where D is a diagonal matrix of eigenvalues. Replace X with its SVD, and see what happens.

CodeHunter

What's wrong with my PCA?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last