Dimension Reduction
High dimensional data
Dimension reduction
A typical machine learning task involves working with a large number of predictors which can make data analysis challenging.
For example, to compare each of the 784 features in our predicting digits example, we would have to create 306,936 scatterplots.
Creating one single scatterplot of the data is impossible due to the high dimensionality.
Dimension reduction
The general idea of dimension reduction is to reduce the dimension of the dataset while preserving important characteristics, such as the distance between features or observations.
With fewer dimensions, data analysis becomes more feasible.
The general technique behind it all, the singular value decomposition, is also useful in other contexts.
We will describe Principal Component Analysis (PCA).
Motivation: preserving distance
We consider an example with twin heights.
Some pairs are adults, the others are children.
Here we simulate 100 two-dimensional points that represent the number of standard deviations each individual is from the mean height.
Each point is a pair of twins.
Motivation: preserving distance
- We see correlation is high and two clusters of twins:
Motivation: preserving distance
Our features are \(n\) two-dimensional points.
We will pretend that visualizing two dimensions is too challenging and want to explore the data through one histogram.
We want to reduce the dimensions from two to one, but still be able to understand important characteristics of the data.
Motivation: preserving distance
- Start by standardizing data
library(matrixStats)
<- sweep(x, 2, colMeans(x))
x <- sweep(x, 2, colSds(x), "/") x
Motivation: preserving distance
- We highlight the distance between observation 1 and 2 (blue), and observation 1 and 51 (red).
Motivation: preserving distance
- We can compute these distances using
dist
:
<- dist(x)
d as.matrix(d)[1, 2]
[1] 0.5949407
as.matrix(d)[2, 51]
[1] 1.388275
Note that the blue line is shorter.
We want our one dimension summary to approximate these distances.
Motivation: preserving distance
Note the blue and red line are almost diagonal.
An intuition is that most the information about distance is in that direction.
We can rotate the points in a way that preserve the distance between points, while increasing the variability in one dimension and reducing it on the other.
Using this method, we keep more of the information about distances in the first dimension.
Rotations
- We saw that any point \((x_1, x_2)^\top\) can be written as the base and height of a triangle with a hypotenuse going from \((0,0)^\top\) to \((x_1, x_2)^\top\):
\[ x_1 = r \cos\phi, \,\, x_2 = r \sin\phi \]
- with \(r\) the length of the hypotenuse and \(\phi\) the angle between the hypotenuse and the x-axis.
Rotations
- To rotate the point \((x_1, x_2)^\top\) around a circle with center \((0,0)^\top\) and radius \(r\) by an angle \(\theta\) we change the angle to \(\phi + \theta\):
\[ z_1 = r \cos(\phi+ \theta), \,\, z_2 = r \sin(\phi + \theta) \]
Rotations
Rotations
- We can use trigonometric identities to rewrite \((z_1, z_2)\):
\[ \begin{aligned} z_1 &= r \cos(\phi + \theta)\\ &= r \cos \phi \cos\theta - r \sin\phi \sin\theta\\ &= x_1 \cos(\theta) - x_2 \sin(\theta)\\ z_2 &= r \sin(\phi + \theta)\\ &= r \cos\phi \sin\theta + r \sin\phi \cos\theta\\ &= x_1 \sin(\theta) + x_2 \cos(\theta) \end{aligned} \]
Rotations
- Here we rotate all points by a \(-45\) degrees:
Rotations
The variability of \(x_1\) and \(x_2\) are similar.
The variability of \(z_1\) is larger than that of \(z_2\).
The distances between points appear to be preserved.
We soon show, mathematically, that distance is preserved.
Linear transformations
- Any time a matrix \(\mathbf{X}\) is multiplied by another matrix \(\mathbf{A}\), we refer to the product
\[\mathbf{Z} = \mathbf{X}\mathbf{A}\]
as a linear transformation of \(\mathbf{X}\).
- We can show that the previously shown rotation is a linear transformation.
Linear transformations
- To see this, note that for any row \(i\), the first entry was:
\[ z_{i,1} = a_{1,1} x_{i,1} + a_{2,1} x_{i,2} \]
- with \(a_{1,1} = \cos\theta\) and \(a_{2,1} = -\sin\theta\).
Linear transformations
- The second entry was also a linear transformation:
\[z_{i,2} = a_{1,2} x_{i,1} + a_{2,2} x_{i,2}\]
- with \(a_{1,2} = \sin\theta\) and \(a_{2,2} = \cos\theta\).
Linear transformations
- We can therefore write these trasformation using the folowing matrix notation:
\[ \begin{pmatrix} z_1\\z_2 \end{pmatrix} = \begin{pmatrix} a_{1,1}&a_{1,2}\\ a_{2,1}&a_{2,2} \end{pmatrix}^\top \begin{pmatrix} x_1\\x_2 \end{pmatrix} \]
Linear transformations
- An advantage of using linear algebra is that we can write the transformation for the entire dataset by saving all observations in a \(N \times 2\) matrix:
\[ \mathbf{X} \equiv \begin{bmatrix} \mathbf{x_1}^\top\\ \vdots\\ \mathbf{x_n}^\top \end{bmatrix} = \begin{bmatrix} x_{1,1}&x_{1,2}\\ \vdots&\vdots\\ x_{n,1}&x_{n,2} \end{bmatrix} \]
Linear transformations
- We can then obtain the rotated values \(\mathbf{z}_i\) for each row \(i\) by applying a linear transformation of \(X\):
\[ \mathbf{Z} = \mathbf{X} \mathbf{A} \mbox{ with } \mathbf{A} = \, \begin{pmatrix} a_{1,1}&a_{1,2}\\ a_{2,1}&a_{2,2} \end{pmatrix} = \begin{pmatrix} \cos \theta&\sin \theta\\ -\sin \theta&\cos \theta \end{pmatrix} . \]
- The columns of \(\mathbf{A}\) are referred to as directions because if we draw a vector from \((0,0)\) to \((a_{1,j}, a_{2,j})\), it points in the direction of the line that will become the \(j-th\) dimension.
Linear transformations
- If we define:
<- -45 * 2*pi/360 #convert to radians
theta <- matrix(c(cos(theta), -sin(theta), sin(theta), cos(theta)), 2, 2) A
- We can write code implementing a rotation by any angle \(\theta\) using linear algebra:
<- function(x, theta){
rotate <- theta*2*pi/360
theta <- matrix(c(cos(theta), -sin(theta), sin(theta), cos(theta)), 2, 2)
A %*% A
x }
Linear transformations
- Another advantage of linear algebra we can convert \(\mathbf{Z}\) back to \(\mathbf{X}\) by multiplying by the inverse $^{-1}.
\[ \mathbf{Z} \mathbf{A}^\top = \mathbf{X} \mathbf{A}\mathbf{A}^\top\ = \mathbf{X} \]
Linear transformations
- In this particular case, we can use trigonometry to show that:
\[ x_{i,1} = b_{1,1} z_{i,1} + b_{2,1} z_{i,2}\\ x_{i,2} = b_{1,2} z_{i,1} + b_{2,2} z_{i,2} \]
- with \(b_{2,1} = \cos\theta\), \(b_{2,1} = \sin\theta\), \(b_{1,2} = -\sin\theta\), and \(b_{2,2} = \cos\theta\).
Linear transformations
- This implies that:
\[ \mathbf{X} = \mathbf{Z} \begin{pmatrix} \cos \theta&-\sin \theta\\ \sin \theta&\cos \theta \end{pmatrix} \]
- This implies that all the information in \(\mathbf{X}\) is included in the rotation \(\mathbf{Z}\).
Linear transformations
- Note that in this case
\[ \begin{pmatrix} \cos \theta&-\sin \theta\\ \sin \theta&\cos \theta \end{pmatrix} = \mathbf{A}^\top \]
which implies
\[ \mathbf{Z} \mathbf{A}^\top = \mathbf{X} \mathbf{A}\mathbf{A}^\top\ = \mathbf{X} \]
and therefore that \(\mathbf{A}^\top\) is the inverse of \(\mathbf{A}\).
Remember that we represent the rows of a matrix as column vectors.
This explains why we use \(\mathbf{A}\) when showing the multiplication for the matrix \(\mathbf{Z}=\mathbf{X}\mathbf{A}\), but transpose the operation when showing the transformation for just one observation: \(\mathbf{z}_i = \mathbf{A}^\top\mathbf{x}_i\).
Linear transformations
- To see that distance is preserved note that the distance between two points \(\mathbf{z}_h\) and \(\mathbf{z}_i\) is
\[ \begin{aligned} ||\mathbf{z}_h - \mathbf{z}_i|| &= ||\mathbf{A} \mathbf{x}_h - \mathbf{A} \mathbf{x}_i|| \\ &= || \mathbf{A} (\mathbf{x}_h - \mathbf{x}_i) || \\ &= (\mathbf{x}_h - \mathbf{x}_i)^{\top} \mathbf{A}^{\top} \mathbf{A} (\mathbf{x}_h - \mathbf{x}_i) \\ &=(\mathbf{x}_h - \mathbf{x}_i)^{\top} (\mathbf{x}_h - \mathbf{x}_i) \\ &= || \mathbf{x}_h - \mathbf{x}_i|| \end{aligned} \]
Linear transformations
- Here is an example for a 30 degree rotation, although it works for any angle:
all.equal(as.matrix(dist(rotate(x, 30))), as.matrix(dist(x)))
[1] TRUE
- Using linear algebra, we can rewrite the quantity above as:
Orthogonal transformations
We refer to transformation with the property \(\mathbf{A} \mathbf{A}^\top = \mathbf{I}\) as orthogonal transformations.
These are guaranteed to preserve the distance between any two points.
We previously demonstrated our rotation has this property.
We can confirm using R:
%*% t(A) A
[,1] [,2]
[1,] 1.000000e+00 1.014654e-17
[2,] 1.014654e-17 1.000000e+00
Orthogonal transformations
- \(\mathbf{A}\) being orthogonal also guarantees that the total sum of squares (TSS) of \(\mathbf{X}\), defined as \(\sum_{i=1}^n \sum_{j=1}^p x_{i,j}^2\) is equal to the total sum of squares of the rotation \(\mathbf{Z} = \mathbf{X}\mathbf{A}^\top\).
\[ \sum_{1=1}^n ||\mathbf{z}_i||^2 = \sum_{i=1}^n ||\mathbf{A}^\top\mathbf{x}_i||^2 = \sum_{i=1}^n \mathbf{x}_i^\top \mathbf{A}\mathbf{A}^\top \mathbf{x}_i = \sum_{i=1}^n \mathbf{x}_i^\top\mathbf{x}_i = \sum_{i=1}^n||\mathbf{x}_i||^2 \]
Orthogonal transformations
- We can confirm using R:
<- -45
theta <- rotate(x, theta) # works for any theta
z sum(x^2)
[1] 198
sum(z^2)
[1] 198
Orthogonal transformations
This can be interpreted as a consequence of the fact that an orthogonal transformation guarantees that all the information is preserved.
However, although the total is preserved, the sum of squares for the individual columns changes.
Transformations
- Here we compute the proportion of TSS attributed to each column, referred to as the variance explained or variance captured by each column, for \(\mathbf{X}\):
colSums(x^2)/sum(x^2)
[1] 0.5 0.5
- and \(\mathbf{Z}\):
colSums(z^2)/sum(z^2)
[1] 0.98477912 0.01522088
- We now explain how useful this property can be.
Principal Component Analysis
We have established that orthogonal transformations preserve the distance between observations and the total sum of squares.
We have also established that, while the TSS remains the same, the way this total is distributed across the columns can change.
The general idea behind Principal Component Analysis (PCA) is to try to find orthogonal transformations that concentrate the variance explained in the first few columns.
We can then focus on these few columns, effectively reducing the dimension of the problem.
Principal Component Analysis
- In our specific example, we are looking for the rotation that maximizes the variance explained in the first column:
Principal Component Analysis
We find that a -45 degree rotation appears to achieve the maximum, with over 98% of the total variability explained by the first dimension.
We denote this rotation matrix with \(\mathbf{V}\):
<- 2*pi*-45/360 #convert to radians
theta <- matrix(c(cos(theta), -sin(theta), sin(theta), cos(theta)), 2, 2) V
Principal Component Analysis
- We can rotate the entire dataset using:
\[ \mathbf{Z} = \mathbf{X}\mathbf{V} \]
- In R:
<- x %*% V z
Principal Component Analysis
Principal Component Analysis
Principal Component Analysis
The first dimension of
z
is referred to as the first principal component (PC).Because almost all the variation is explained by this first PC, the distance between rows in
x
can be very well approximated by the distance calculated with justz[,1]
.
Principal Component Analysis
::mypar()
rafalibplot(dist(x), dist(z[,1]))
abline(0,1, col = "red")
Principal Component Analysis
- The two groups can be clearly observed with the one dimension:
- Better than with any of the two original dimensions.
Principal Component Analysis
- We can visualize these to see how the first component summarizes the data:
Principal Component Analysis
This idea generalizes to dimensions higher than 2.
As done in our two dimensional example, we start by finding the \(p \times 1\) vector \(\mathbf{v}_1\) with \(||\mathbf{v}_1||=1\) that maximizes \(||\mathbf{X} \mathbf{v}_1||\).
The projection \(\mathbf{X} \mathbf{v}_1\) is the first PC.
Principal Component Analysis
- To find the second PC, we subtract the variation explained by first PC from \(\mathbf{X}\):
\[ \mathbf{r} = \mathbf{X} - \mathbf{X} \mathbf{v}_1 \mathbf{v}_1^\top \]
and then find the vector \(\mathbf{v}_2\) with\(||\mathbf{v}_2||=1\) that maximizes \(||\mathbf{r} \mathbf{v}_2||\).
The projection \(\mathbf{X} \mathbf{v}_2\) is the second PC.
Principal Component Analysis
- We then subtract the variation explained by the first two PCs, and continue this process until we have the entire rotation matrix and matrix of principal components, respectively:
\[ \mathbf{V} = \begin{bmatrix} \mathbf{v}_1&\dots&\mathbf{v}_p \end{bmatrix}, \mathbf{Z} = \mathbf{X}\mathbf{V} \]
- The ideas of distance preservation extends to higher dimensions.
Principal Component Analysis
For a multidimensional matrix with \(p\) columns, we can find an orthogonal transformation \(\mathbf{A}\) that preserves the distance between rows, but with the variance explained by the columns in decreasing order.
If the variances of the columns \(\mathbf{Z}_j\), \(j>k\) are very small, these dimensions have little to contribute to the distance calculation and we can approximate the distance between any two points with just \(k\) dimensions.
If \(k\) is much smaller than \(p\), then we can achieve a very efficient summary of our data.
Notice that the solution to this maximization problem is not unique because \(||\mathbf{X} \mathbf{v}|| = ||-\mathbf{X} \mathbf{v}||\).
Also, note that if we multiply a column of \(\mathbf{A}\) by \(-1\), we still represent \(\mathbf{X}\) as \(\mathbf{Z}\mathbf{V}^\top\) as long as we also multiple the corresponding column of \(\mathbf{V}\) by -1.
This implies that we can arbitrarily change the sign of each column of the rotation \(\mathbf{V}\) and principal component matrix \(\mathbf{Z}\).
Principal Component Analysis
- In R, we can find the principal components of any matrix with the function
prcomp
:
<- prcomp(x, center = FALSE) pca
- The default behavior is to center the columns of
x
before computing the PCs, an operation we don’t currently need because our matrix is scaled.
Principal Component Analysis
The object
pca
includes the rotated data \(Z\) inpca$x
and the rotation \(\mathbf{V}\) inpca$rotation
.We can see that columns of the
pca$rotation
are indeed the rotation obtained with -45 (remember the sign is arbitrary):
$rotation pca
PC1 PC2
[1,] -0.7071068 0.7071068
[2,] -0.7071068 -0.7071068
Principal Component Analysis
The square root of the variation of each column is included in the
pca$sdev
component.This implies we can compute the variance explained by each PC using:
$sdev^2/sum(pca$sdev^2) pca
[1] 0.98477912 0.01522088
Principal Component Analysis
- The function
summary
performs this calculation:
summary(pca)
Importance of components:
PC1 PC2
Standard deviation 1.4034 0.17448
Proportion of Variance 0.9848 0.01522
Cumulative Proportion 0.9848 1.00000
Principal Component Analysis
- We also see that we can rotate
x
(\(\mathbf{X}\)) andpca$x
(\(\mathbf{Z}\)) as explained with the mathematical earlier:
all.equal(pca$x, x %*% pca$rotation)
[1] TRUE
all.equal(x, pca$x %*% t(pca$rotation))
[1] TRUE
Iris example
The iris data is a widely used example.
It includes four botanical measurements related to three flower species:
names(iris)
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
head(iris$Species)
[1] setosa setosa setosa setosa setosa setosa
Levels: setosa versicolor virginica
Iris example
- If we visualize the distances, we see the three species:
Iris example
Our features matrix has four dimensions
Three are very correlated:
cor(x)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
Iris example
If we apply PCA, we should be able to approximate this distance with just two dimensions, compressing the highly correlated dimensions.
Using the
summary
function, we can see the variability explained by each PC:
<- prcomp(x)
pca summary(pca)
Importance of components:
PC1 PC2 PC3 PC4
Standard deviation 2.0563 0.49262 0.2797 0.15439
Proportion of Variance 0.9246 0.05307 0.0171 0.00521
Cumulative Proportion 0.9246 0.97769 0.9948 1.00000
Iris example
- We are able to approximate the distances with two dimensions:
Iris example
- A useful application is we can now visualize with a two-dimensional plot:
data.frame(pca$x[,1:2], Species = iris$Species) |>
ggplot(aes(PC1, PC2, fill = Species)) +
geom_point(cex = 3, pch = 21) +
coord_fixed(ratio = 1)
Iris example
PCA visualized
Iris example
We learn that:
the first PC ia weighted average of sepal length, petal length, and petal width (red in first column), and subtracting a a quantity proportional to sepal width (blue in first column).
The second PC is a weighted average of petal length and petal width, minus a weighted average of sepal length and petal width.
MNIST example
The written digits example has 784 features.
Is there any room for data reduction? We will use PCA to answer this.
We expect pixels close to each other on the grid to be correlated: dimension reduction should be possible.
MNIST example
- Let’s compute the PCs:
<- prcomp(mnist$train$images) pca
MNIST example
And look at the variance explained:
plot(pca$sdev^2/sum(pca$sdev^2), xlab = "PC", ylab = "Variance explained")
MNIST example
- First two PCs for a random sample of 500 digits:
MNIST example
- We can also see the rotation values on the 28 \(\times\) 28 grid to get an idea of how pixels are being weighted in the transformations that result in the PCs.
First four PCs
First PC
Second PC
Third PC
Fourth PC
MNIST example
- We can also confirm that the lower variance PCs appear related to unimportant variability, mainly smudges in the corners.
Last four PCs
200 dimensions approximation
<- pca$x[,1:200] %*% t(pca$rotation[,1:200]) images_hat
200 dimensions approximation
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ lubridate 1.9.3 ✔ tibble 3.2.1
✔ purrr 1.0.2 ✔ tidyr 1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::combine() masks gridExtra::combine()
✖ dplyr::count() masks matrixStats::count()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
✖ dplyr::select() masks MASS::select()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors