Vectorization

2024-09-18

Vectorization

We will be using the murders dataset in the dslabs package.
Includes data on 2010 gun murders for the US 50 states and DC.
We will use it to answer questions such as “What is the state with lowest crime rate in the Western part of the US?”

Vectorization

First, some simple examples of vectorization.
Let’s convert the following heights in inches to meters:

heights <- c(69, 62, 66, 70, 70, 73, 67, 73, 67, 70)

Rather than loop we use vectorization:

heights*2.54/100

 [1] 1.7526 1.5748 1.6764 1.7780 1.7780 1.8542 1.7018 1.8542 1.7018 1.7780

Vectorization

We can subtract a constant from each element of a vector.
This is convenient for computing residuals or deviations from an average:

avg <- mean(heights)
heights - avg

 [1]  0.3 -6.7 -2.7  1.3  1.3  4.3 -1.7  4.3 -1.7  1.3

Vectorization

This means we can compute standard units like this:

s <- sd(heights)
(heights - avg)/s

 [1]  0.08995503 -2.00899575 -0.80959530  0.38980515  0.38980515  1.28935548
 [7] -0.50974519  1.28935548 -0.50974519  0.38980515

There is actually a function, scale, that does this. We describe it soon.

Vectorization

If we operate on two vectors, vectorization is componentwise.
Here is an example:

heights <- c(69, 62, 66, 70, 70, 73, 67, 73, 67, 70)
error <- rnorm(length(heights), 0, 0.1)
heights + error

 [1] 69.03952 61.84743 66.01834 69.88322 69.69173 73.23231 66.96888 72.94392
 [9] 66.91477 69.96632

Exercise

Add a column to the murders dataset with the murder rate.
Use murders per 100,000 persons as the unit.

Functions that vectorize

Most arithmetic functions work on vectors.

x <- 1:10
sqrt(x)

 [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427
 [9] 3.000000 3.162278

log(x)

 [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101
 [8] 2.0794415 2.1972246 2.3025851

2^x

 [1]    2    4    8   16   32   64  128  256  512 1024

Functions that vectorize

scale(heights)

             [,1]
 [1,]  0.08995503
 [2,] -2.00899575
 [3,] -0.80959530
 [4,]  0.38980515
 [5,]  0.38980515
 [6,]  1.28935548
 [7,] -0.50974519
 [8,]  1.28935548
 [9,] -0.50974519
[10,]  0.38980515
attr(,"scaled:center")
[1] 68.7
attr(,"scaled:scale")
[1] 3.335

provides the same results,

(heights - mean(heights))/sd(heights)

 [1]  0.08995503 -2.00899575 -0.80959530  0.38980515  0.38980515  1.28935548
 [7] -0.50974519  1.28935548 -0.50974519  0.38980515

Functions that vectorize

But scale coerces to a column matrix:

class(scale(heights))

[1] "matrix" "array"

Functions that vectorize

The conditional function if-else does not vectorize.
Functions such as any and all, covert vectors to logicals of lenght one needed for if-else.
A particularly useful function is a vectorized version ifelse.
Here is an example:

a <- c(0, 1, 2, -4, 5)
ifelse(a > 0, 1/a, NA)

[1]  NA 1.0 0.5  NA 0.2

Indexing

Vectorization also works for logical relationships:

library(dslabs)
ind <- murders$population < 10^6

A convenient aspect of this is that you can subset a vector using this logical vector for indexing:

murders$state[ind]

[1] "Alaska"               "Delaware"             "District of Columbia"
[4] "Montana"              "North Dakota"         "South Dakota"        
[7] "Vermont"              "Wyoming"

Indexing

You can also use vectorization to apply logical operators:

ind <- murders$population < 10^6 & murders$region == "West"
murders$state[ind]

[1] "Alaska"  "Montana" "Wyoming"

`split`

split is a useful function to get indexes using a factor:

inds <- with(murders, split(seq_along(region), region))
murders$state[inds$West]

 [1] "Alaska"     "Arizona"    "California" "Colorado"   "Hawaii"    
 [6] "Idaho"      "Montana"    "Nevada"     "New Mexico" "Oregon"    
[11] "Utah"       "Washington" "Wyoming"

Functions for subsetting

The functions which, match and the operator %in% are useful for sub-setting
To understand how they work it’s best to use examples.

which

ind <- which(murders$state == "California")
ind

[1] 5

murders[ind,]

       state abb region population total
5 California  CA   West   37253956  1257

match

ind <- match(c("New York", "Florida", "Texas"), murders$state)
ind

[1] 33 10 44

murders[ind,]

      state abb    region population total
33 New York  NY Northeast   19378102   517
10  Florida  FL     South   19687653   669
44    Texas  TX     South   25145561   805

%in%

ind <- which(murders$state %in% c("New York", "Florida", "Texas"))
ind

[1] 10 33 44

murders[ind,]

      state abb    region population total
10  Florida  FL     South   19687653   669
33 New York  NY Northeast   19378102   517
44    Texas  TX     South   25145561   805

Note this is similar to using match.
But note the order is different.

match versus %in%

c("Boston", "Dakota", "Washington") %in% murders$state

[1] FALSE FALSE  TRUE

match(c("Boston", "Dakota", "Washington"), murders$state)

[1] NA NA 48

match(murders$state, c("Boston", "Dakota", "Washington"))

 [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA  3 NA NA
[51] NA

The apply functions

The apply functions let use the concept of vectorization with functions that don’t vectorize.
Here is an example of a function that won’t vectorize in a convenient way:

s <- function(n){
   return(sum(1:n))
}

Try it on a vector:

ns <- c(25, 100, 1000)
s(ns)

[1] 325

The apply functions

We can use sapply, one of the apply functions:

sapply(ns, s)

[1]    325   5050 500500

sapply will work on any vector, including lists.

The apply functions

There are other apply functions:
- lapply - returns a list. Convenient when the function returns something other than a number.
- tapply - can apply to subsets defined by second variable.
- mapply - multivariate version of sapply.
- apply - applies function to rows or columns o matrix.
We will learn some of these as we go.