Distance

Published

December 2, 2024

Keywords

High dimensional data

Distance

Many of the analyses we perform with high-dimensional data relate directly or indirectly to distance.
Many machine learning techniques rely on defining distances between observations.
Clustering algorithms search of observations that are similar.
But what does this mean mathematically?

The norm

A point can be represented in polar coordinates:

The norm

If \(\mathbf{x} = (x_1, x_2)^\top\), \(r\) defines the norm of \(\mathbf{x}\).

The norm

The point of defining the norm is to extrapolate the concept of size to higher dimensions.
Specifically, we write the norm for any vector \(\mathbf{x}\) as:

\[ ||\mathbf{x}|| = \sqrt{x_1^2 + x_2^2 + \dots + x_p^2} \]

Sometimes convenient to write like this:

\[ ||\mathbf{x}||^2 = x_1^2 + x_2^2 + \dots + x_p^2 \]

The norm

We define the norm like this:

\[ ||\mathbf{x}||^2 = \mathbf{x}^\top\mathbf{x} \]

Distance

Distance is the norm of the difference:

Distance

-We can see this using the definition we know:

\[ \mbox{distance} = \sqrt{(x_{11} - x_{12})^2 + (x_{21} - x_{22})^2} \]

Distance

Using the norm definition can be extrapolated to any dimension:

\[ \mbox{distance} = || \mathbf{x}_1 - \mathbf{x}_2|| \]

Distance

For example, the distance between the first and second observation will compute distance using all 784 features:

\[ || \mathbf{x}_1 - \mathbf{x}_2 ||^2 = \sum_{j=1}^{784} (x_{1,j}-x_{2,j })^2 \]

Distance

Define the features and labels:

mnist <- read_mnist()
x <- mnist$train$images  
y <- mnist$train$labels

x_1 <- x[6,] 
x_2 <- x[17,] 
x_3 <- x[16,]

Compute the distances:

c(sum((x_1 - x_2)^2), sum((x_1 - x_3)^2), sum((x_2 - x_3)^2)) |> sqrt()

[1] 2319.867 2331.210 2518.969

Checks out:

y[c(6,17,16)]

[1] 2 2 7

Distance

In R, the function crossprod(x) is convenient for computing norms.
It multiplies t(x) by x:

c(crossprod(x_1 - x_2), crossprod(x_1 - x_3), crossprod(x_2 - x_3)) |> sqrt()

[1] 2319.867 2331.210 2518.969

Distance

We can also compute all the distances at once:

d <- dist(x[c(6,17,16),]) 
d

         1        2
2 2319.867         
3 2331.210 2518.969

dist produces an object of class dist

class(d)

[1] "dist"

There are several machine learning related functions in R that take objects of class dist as input.

Distance

dist objects are similar but not equal to a matrices.
To access the entries using row and column indices, we need to coerce it into a matrix.

as.matrix(d)[2,3]

[1] 2518.969

Distance

The image function allows us to quickly see an image of distances between observations.

d <- dist(x[1:300,]) 
image(as.matrix(d))

Distance

If we order distance by the labels:

image(as.matrix(d)[order(y[1:300]), order(y[1:300])])

Spaces

Predictor space is a concept that is often used to describe machine learning algorithms.
We can think of all predictors \((x_{i,1}, \dots, x_{i,p})^\top\) for all observations \(i=1,\dots,n\) as \(n\) \(p\)-dimensional points.
The space is the collection of all possible points that should be considered for the data analysis in question, including points we have not observed yet.
In the case of the handwritten digits, we can think of the predictor space as any point \((x_{1}, \dots, x_{p})^\top\) as long as each entry \(x_i, \, i = 1, \dots, p\) is between 0 and 255.

Spaces

Some Machine Learning algorithms also define subspaces.
A commonly defined subspace in machine learning are neighborhoods composed of points that are close to a predetermined center.
We do this by selecting a center \(\mathbf{x}_0\), a minimum distance \(r\), and defining the subspace as the collection of points \(\mathbf{x}\) that satisfy:

\[ || \mathbf{x} - \mathbf{x}_0 || \leq r. \]

Spaces

We can think of this subspace as a multidimensional sphere since every point is the same distance away from the center.
Other machine learning algorithms partition the predictor space into non-overlapping regions and then make different predictions for each region using the data in the region.