2024-11-12
When the number of variables is large and they can all be represented as a number, it is convenient to store them in a matrix and perform the analysis with linear algebra operations, rather than using tidyverse with data frames.
Variables for each observation are stored in a row, resulting in a matrix with as many columns as variables.
We refer to values represented in the rows of the matrix as the covariates or predictors and, in machine learning, we refer to them as the features.
In linear algebra, we have three types of objects: scalars, vectors, and matrices.
We have already learned about vectors in R, and, although there is no data type for scalars, we can represent them as vectors of length 1.
Today we learn how to work with matrices in R and relate them to linear algebra notation and concepts.
Soon we will describe how we can build computer algorithms to read handwritten digits, which robots then use to sort the letters.
To do this, we first need to collect data, which in this case is a high-dimensional dataset and best stored in a matrix.
Examples:
For each digitized image, indexed by \(i\), we are provided with 784 variables and a categorical outcome, or label, representing the digit among \(0, 1, 2, 3, 4, 5, 6, 7 , 8,\) and \(9\) that the image is representing.
Let’s load the data using the dslabs package:
Visualize the original image. The pixel intensities are provided as rows in a matrix.
Do some digits require more ink to write than others?
Are some pixels uninformative?
Can we remove smudges?
Binarize the data.
Standardize the digits.
The tidyverse or data.table are not developed to perform these types of mathematical operations.
For this task, it is convenient to use matrices.
To simplify the code below, we will rename these x
and y
respectively:
nrow
function tells us how many rows that matrix has:and ncol
tells us how many columns:
We learn that our dataset contains 60,000 observations (images) and 784 features (pixels).
The dim
function returns the rows and columns:
matrix
function.byrow
argument:as.vector
converts a matrix back into a vector:Warning
If the product of columns and rows does not match the length of the vector provided in the first argument, matrix
recycles values.
If the length of the vector is a sub-multiple or multiple of the number of rows, this happens without warning:
We can extract subsets of the matrices by using vectors of indexes.
For example, we can extract the first 100 pixels from the first 300 observations like this:
Similarly, we can subset any number of columns by keeping the first dimension blank.
Here is the code to extract the first 100 pixels:
The third row of the matrix x[3,]
contains the 784 pixel intensities.
We can assume these were entered in order and convert them back to a \(28 \times 28\) matrix using:
image
in the followin way:A common operation with matrices is to apply the same function to each row or to each column.
For example, we may want to compute row averages and standard deviations.
The apply
function lets you do this.
The first argument is the matrix, the second is the dimension, 1 for rows, 2 for columns, and the third is the function to be applied.
Because these operations are so common, special functions are available to perform them.
The functions rowMeans
computes the average of each row:
rowSds
computes the standard deviations for each row:The functions colMeans
and colSds
provide the version for columns.
For more fast implementations consider the functions available in matrixStats.
For the second task, related to total pixel darkness, we want to see the average use of ink plotted against digit.
We have already computed this average and can generate a boxplot to answer the question:
One of the advantages of matrices operations over tidyverse operations is that we can easily select columns based on summaries of the columns.
Note that logical filters can be used to subset matrices in a similar way in which they can be used to subset vectors.
[,1] [,2] [,3]
[1,] 4 7 13
[2,] 5 8 14
[3,] 6 9 15
NA
:We can use these ideas to remove columns associated with pixels that don’t change much and thus do not inform digit classification.
We will quantify the variation of each pixel with its standard deviation across all entries.
colSds
function from the matrixStats package:We could remove features that have no variation since these can’t help us predict.
So if we wanted to remove uninformative predictors from our matrix, we could write this one line of code:
An operation that facilitates efficient coding is that we can change entries of a matrix based on conditionals applied to that same matrix.
Here is a simple example:
NA
entries of a matrix to something else:The plot shows a clear dichotomy which is explained as parts of the image with ink and parts without.
If we think that values below, say, 50 are smudges, we can quickly make them zero using:
The previous histogram seems to suggest that this data is mostly binary.
A pixel either has ink or does not.
\[ \begin{bmatrix} X_{1,1}&\dots & X_{1,p} \\ X_{2,1}&\dots & X_{2,p} \\ & \vdots & \\ X_{n,1}&\dots & X_{n,p} \end{bmatrix} - \begin{bmatrix} a_1\\\ a_2\\\ \vdots\\\ a_n \end{bmatrix} = \begin{bmatrix} X_{1,1}-a_1&\dots & X_{1,p} -a_1\\ X_{2,1}-a_2&\dots & X_{2,p} -a_2\\ & \vdots & \\ X_{n,1}-a_n&\dots & X_{n,p} -a_n \end{bmatrix} \]
The same holds true for other arithmetic operations.
The function sweep
facilitates this type of operation.
It works similarly to apply
.
It takes each entry of a vector and applies an arithmetic operation to the corresponding row.
Subtraction is the default arithmetic operation.
So, for example, to center each row around the average, we can use:
Yet this approach does not work for columns.
For columns, we can sweep
:
In R, if you add, subtract, multiple or divide two matrices, the operation is done elementwise.
For example, if two matrices are stored in x
and y
, then:
does not result in matrix multiplication.
Instead, the entry in row \(i\) and column \(j\) of this product is the product of the entry in row \(i\) and column \(j\) of x
and y
, respectively.