PCA and SVD

The goal of PCA is to find a new set of dimensions that better captures the variability of the data. More specifically, the first dimension is chosen to capture as much of the variability as possible. The second dimension is orthogonal to the first, and, subject to that constraint, captures as much of the remaining variability as possible, and so on.

Statisticians summarize the variability of a collection of multivariate data; i.e. data that has continuous attribute by computing the covariance matrix $S$ of the data.

Given an $m$ by $n$ data matrix $D$ , whose $m$ rows are data objects and whose $n$ columns are attributes, the covariance matrix of $D$ is the matrix $S,$ which has entries $s_{ij}$ defined as: $s_{ij} = covariance (d_{i}, d_{j})$ If the data matrix $D$ is prepossessed so that the mean of each attribute is $0$ , then $S = D^{T} D$ .

A PCA finds a transformation of data that satisfies the following properties:

Each pair of new attributes has $0$ covariance for distinct attributes.
The attributes are ordered with respect to how much of the variance of the data each attribute captures.
The first attribute captures as much of the variance of the data as possible.
Subject to the orthogonality requirement, each successive attribute captures as much of the remaining variance as possible.

A transformation of the data that has these properties can be obtained by using eigenvalue analysis of the covariance matrix. Let $λ_{1}, \dots, λ_{n}$ be the eigenvalues of $S$ .

The eigenvalues are all non-negative and can be ordered such that $λ_{1} \geq λ_{2} \geq \dots λ_{m - 1} \geq λ_{m}$ . Covariance matrices are examples of what are called positive semidefinite matrices, which, among other properties, have non-negative eigenvalues.

Let $U = [u_{1}, ..., u_{n}]$ be the matrix of eigenvectors of $S$ . These eigenvectors are ordered so that the $i$ th eigenvector corresponds to the $i$ th largest eigenvalue.

Finally, assume that data matrix $D$ has been preprocessed so that mean of each attribute is 0. We can make following statements:

The data matrix $D^{'} = D U$ is the set of transformed data that satisfies the conditions posed above.
Each new attribute is a linear combination of the original attributes. Specially, the weights of the linear combination for the $i$ th attribute are the components of $i$ th eigenvector.
The variance of the $i$ th new attribute is $λ_{i}$ .
The sum of the variance of the original attributes is equal to the sum of the variance of the new attributes.
The new attributes are called principal components; i.e., the first new attribute is the first principal component, the second new attribute is the second principal component, and so on.

(LATE NEED TO REVISE)

One major approach is projection in which in n-dimension space we choose d-dimension hypercube and project onto it, which hypercube to choose is by projecting the datasets on the axes and seeing whichevers variance is high as it means less information is lost -The axis of the d dimensional hypercube are called components and to find them is easy just by taking svd of training set matrix where the d part of the decompositoin gives the component vectors (see the visualization of svd) -Sad is it actually just means trianing time is reduced does not necessarily means performance is improved sometimes can degrade though

My Knowledge Base

Explorer

PCA and SVD