Explain PCA and 2 ways to calculate it

There are lots resources about Principal Component Analysis (PCA) yet it is one such abstract knowledge many ML engineers fail to fully understanding. One perhaps is able to use Scikit Learn API to use in ML pipeline but knowing the underlying math. This post will glide though the intuition of PCA as if explaining it to a young kid, and then to manually calculate the PCA for us to learn insights behind the API.

Intuition

What is PCA

PCA is a method of summarising a wide table using fewer characteristics. For example, say there are 10 kinds of fizzy drinks. Each of them has its own flavour, colour, clarity, sweetness, smoothness, acidity, price etc.. You need to place them on one shelf of 4 levels. You need to place the similar drinks close together. With so many characteristics all share similar parts, which ones do you choose? PCA lets you discover a new way to group the similar drinks together.

It lets you to summarise (represent) a higher dimensional data onto a lower dimensional space. (3D –> 2D, 2D –> 1D)

Note, PCA does not simply discard (a) dimension(s) to achieve the result (lol). Instead, it constructs new characteristics about the drink using the existing ones. For the drinks, a new property may be it’s sweet, colourless with the same price. PCA will find the best possible characteristics among conceivable linear options.

Calculating PCA

Let’s use the Iris dataset as example

Preparing the dataset as

iris = datasets.load_iris()
X, y = shuffle(iris.data, iris.target, random_state=42)

iris.data.shape
>> (150, 4)

iris.data[:3]
>> array([[ 5.1,  3.5,  1.4,  0.2],
          [ 4.9,  3. ,  1.4,  0.2],
          [ 4.7,  3.2,  1.3,  0.2]])

Let’s plot it, see what they look like on canvas.

X, y = shuffle(iris.data, iris.target, random_state=42)
fig, ax = plt.subplots(1, figsize=(8, 6,))
y = np.choose(y, [1, 2, 0]).astype('float32')

ax.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.spectral, edgecolor='k')
ax.set_xticklabels([])
ax.set_yticklabels([])
plt.show()

png

We can see that there are two classes are not very well separated (Black and Grey)

Method 1: Calculate PCA through eigenvalues

Data preprocessing (Common to both)

  1. Mean normalisation Mean normalisation is important to PCA as it helps to maximise the variance of the data. More variance, more room for the algorithm to separate different groups and find optimal PCA.

  2. Optional feature scaling If your data tend to have different scales say in a house dataset, you have columns to describe the number of rooms also the size of the room in square metres.

We will then perform following steps:

  1. Compute Covariance
    \displaystyle var\left(X\right)=\frac{\Sigma_{i=i}^n\left(X_i-\overline{X}\right)\left(X_i-\overline{X}\right)}{\left(n-1\right)}
  2. Compute “Eigenvectors” of Covariance Matrix
    \displaystyle \Sigma \vec{v}=\lambda \vec{v}
  3. Select top N biggest eigenvectors
  4. Project dataset onto subspace
X, y = shuffle(iris.data, iris.target, random_state=42)

def PCA1(data):
    m, n = data.shape
    # mean center the data
    data -= data.mean(axis=0)
    
    # or simply np.cov(data, rowvar=False)
    covariance = np.dot(data.T, data.conj()) / m    
    eig_val, eig_vec = np.linalg.eigh(covariance)

    # Sort the eigenvectors by eigenvalues
    eig_pairs = [
        (np.abs(eig_val[i]), eig_vec[:,i]) for i in range(len(eig_val))
    ]
    eig_pairs.sort(key=lambda x: x[0], reverse=True)    
    return np.hstack((
        eig_pairs[0][1].reshape((4, 1)),
        eig_pairs[1][1].reshape((4, 1))
    ))

eigen_vecs = PCA1(X)
fig, ax = plt.subplots(1, figsize=(8, 6,))
y = np.choose(y, [1, 2, 0])

# Project dataset onto subspace
transformed = np.dot(eigen_vecs.T, X.T).T
ax.scatter(transformed[:, 0], transformed[:, 1], c=y, cmap=plt.cm.spectral, edgecolor='k')

plt.show()

png

Method 2: Calculate PCA through SVD

In Professor Andrew Ng’s lecture video, they use Singular Value Decomposition instead of eigenvector decomposition of covariance matrix because SVD is numerically more robust. There are cases where computing the covariance matrix leads to numerical problems.

X, y = shuffle(iris.data, iris.target, random_state=42)

def PCA2(data):
    """
    PCA via SVD
    """
    m, n = data.shape
    data -= data.mean(axis=0)
    [u, s, v] = np.linalg.svd(X)
    v = v.transpose()
    v = v[:,:2]
    return v

eigen_vecs = PCA2(X)
fig, ax = plt.subplots(1, figsize=(8, 6,))
y = np.choose(y, [1, 2, 0]).astype('float32')
transformed =  np.dot(X, eigen_vecs)
ax.scatter(transformed[:, 0], transformed[:, 1], c=y, cmap=plt.cm.spectral, edgecolor='k')
plt.show()

png

Use Scikit API

Of course, practically, one won’t write all that for PCA, it’s been done in Scikit API.

from sklearn import decomposition

pca = decomposition.PCA(n_components=2)
pca.fit(X)
X = pca.transform(X)