Vectorization

2024-09-18

Vectorization

  • We will be using the murders dataset in the dslabs package.

  • Includes data on 2010 gun murders for the US 50 states and DC.

  • We will use it to answer questions such as “What is the state with lowest crime rate in the Western part of the US?”

Vectorization

  • First, some simple examples of vectorization.

  • Let’s convert the following heights in inches to meters:

heights <- c(69, 62, 66, 70, 70, 73, 67, 73, 67, 70)
  • Rather than loop we use vectorization:
heights*2.54/100
 [1] 1.7526 1.5748 1.6764 1.7780 1.7780 1.8542 1.7018 1.8542 1.7018 1.7780

Vectorization

  • We can subtract a constant from each element of a vector.

  • This is convenient for computing residuals or deviations from an average:

avg <- mean(heights)
heights - avg 
 [1]  0.3 -6.7 -2.7  1.3  1.3  4.3 -1.7  4.3 -1.7  1.3

Vectorization

  • This means we can compute standard units like this:
s <- sd(heights)
(heights - avg)/s
 [1]  0.08995503 -2.00899575 -0.80959530  0.38980515  0.38980515  1.28935548
 [7] -0.50974519  1.28935548 -0.50974519  0.38980515
  • There is actually a function, scale, that does this. We describe it soon.

Vectorization

  • If we operate on two vectors, vectorization is componentwise.

  • Here is an example:

heights <- c(69, 62, 66, 70, 70, 73, 67, 73, 67, 70)
error <- rnorm(length(heights), 0, 0.1)
heights + error
 [1] 69.03952 61.84743 66.01834 69.88322 69.69173 73.23231 66.96888 72.94392
 [9] 66.91477 69.96632

Exercise

  • Add a column to the murders dataset with the murder rate.

  • Use murders per 100,000 persons as the unit.

Functions that vectorize

  • Most arithmetic functions work on vectors.
x <- 1:10
sqrt(x)
 [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427
 [9] 3.000000 3.162278
log(x)
 [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101
 [8] 2.0794415 2.1972246 2.3025851
2^x
 [1]    2    4    8   16   32   64  128  256  512 1024

Functions that vectorize

scale(heights)
             [,1]
 [1,]  0.08995503
 [2,] -2.00899575
 [3,] -0.80959530
 [4,]  0.38980515
 [5,]  0.38980515
 [6,]  1.28935548
 [7,] -0.50974519
 [8,]  1.28935548
 [9,] -0.50974519
[10,]  0.38980515
attr(,"scaled:center")
[1] 68.7
attr(,"scaled:scale")
[1] 3.335

provides the same results,

(heights - mean(heights))/sd(heights)
 [1]  0.08995503 -2.00899575 -0.80959530  0.38980515  0.38980515  1.28935548
 [7] -0.50974519  1.28935548 -0.50974519  0.38980515

Functions that vectorize

  • But scale coerces to a column matrix:
class(scale(heights))
[1] "matrix" "array" 

Functions that vectorize

  • The conditional function if-else does not vectorize.

  • Functions such as any and all, covert vectors to logicals of lenght one needed for if-else.

  • A particularly useful function is a vectorized version ifelse.

  • Here is an example:

a <- c(0, 1, 2, -4, 5)
ifelse(a > 0, 1/a, NA)
[1]  NA 1.0 0.5  NA 0.2

Indexing

  • Vectorization also works for logical relationships:
library(dslabs)
ind <- murders$population < 10^6
  • A convenient aspect of this is that you can subset a vector using this logical vector for indexing:
murders$state[ind]
[1] "Alaska"               "Delaware"             "District of Columbia"
[4] "Montana"              "North Dakota"         "South Dakota"        
[7] "Vermont"              "Wyoming"             

Indexing

  • You can also use vectorization to apply logical operators:
ind <- murders$population < 10^6 & murders$region == "West"
murders$state[ind]
[1] "Alaska"  "Montana" "Wyoming"

split

  • split is a useful function to get indexes using a factor:
inds <- with(murders, split(seq_along(region), region))
murders$state[inds$West]
 [1] "Alaska"     "Arizona"    "California" "Colorado"   "Hawaii"    
 [6] "Idaho"      "Montana"    "Nevada"     "New Mexico" "Oregon"    
[11] "Utah"       "Washington" "Wyoming"   

Functions for subsetting

  • The functions which, match and the operator %in% are useful for sub-setting

  • To understand how they work it’s best to use examples.

which

ind <- which(murders$state == "California")
ind
[1] 5
murders[ind,]
       state abb region population total
5 California  CA   West   37253956  1257

match

ind <- match(c("New York", "Florida", "Texas"), murders$state)
ind
[1] 33 10 44
murders[ind,]
      state abb    region population total
33 New York  NY Northeast   19378102   517
10  Florida  FL     South   19687653   669
44    Texas  TX     South   25145561   805

%in%

ind <- which(murders$state %in% c("New York", "Florida", "Texas"))
ind
[1] 10 33 44
murders[ind,]
      state abb    region population total
10  Florida  FL     South   19687653   669
33 New York  NY Northeast   19378102   517
44    Texas  TX     South   25145561   805
  • Note this is similar to using match.

  • But note the order is different.

match versus %in%

c("Boston", "Dakota", "Washington") %in% murders$state
[1] FALSE FALSE  TRUE
match(c("Boston", "Dakota", "Washington"), murders$state)
[1] NA NA 48
match(murders$state, c("Boston", "Dakota", "Washington"))
 [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA  3 NA NA
[51] NA

The apply functions

  • The apply functions let use the concept of vectorization with functions that don’t vectorize.

  • Here is an example of a function that won’t vectorize in a convenient way:

s <- function(n){
   return(sum(1:n))
}
  • Try it on a vector:
ns <- c(25, 100, 1000)
s(ns)
[1] 325

The apply functions

  • We can use sapply, one of the apply functions:
sapply(ns, s)
[1]    325   5050 500500
  • sapply will work on any vector, including lists.

The apply functions

  • There are other apply functions:

    • lapply - returns a list. Convenient when the function returns something other than a number.

    • tapply - can apply to subsets defined by second variable.

    • mapply - multivariate version of sapply.

    • apply - applies function to rows or columns o matrix.

  • We will learn some of these as we go.