5  Vectorization

A European friend has a great job offer from USA but is concerned about gun violence.

The murders dataset in the dslabs package includes data on gun murders for the US 50 states and DC. Use this to prepare a report for your fried to help them decide where to live. Note your friend likes hiking so might prefer the west. Your friend does not like low population density.

library(dslabs)

5.1 Arithmetics

heights <- c(69, 62, 66, 70, 70, 73, 67, 73, 67, 70)

Convert to meters:

heights * 2.54 / 100
 [1] 1.7526 1.5748 1.6764 1.7780 1.7780 1.8542 1.7018 1.8542 1.7018 1.7780

Difference from the average:

avg <- mean(heights)
heights - avg 
 [1]  0.3 -6.7 -2.7  1.3  1.3  4.3 -1.7  4.3 -1.7  1.3

Exercise: compute the height in standardized units

s <- sd(heights)
(heights - avg) / s
 [1]  0.08995503 -2.00899575 -0.80959530  0.38980515  0.38980515  1.28935548
 [7] -0.50974519  1.28935548 -0.50974519  0.38980515
# can also use scale(heights)

If it’s two vectors, it does it component wise:

heights <- c(69, 62, 66, 70, 70, 73, 67, 73, 67, 70)
error <- rnorm(length(heights), 0, 0.1)
heights + error
 [1] 68.76366 61.95831 66.23369 70.01296 70.00050 73.08083 66.91846 73.23657
 [9] 66.98593 69.99302

Exercise:

Add a column to the murders dataset with the murder rate in per 100,000.

library(dslabs)
murders$rate <- with(murders, total / population * 10^5)

5.2 Functions that vectorize

Most arithmetic functions work on vectors

x <- 1:10
sqrt(x)
 [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427
 [9] 3.000000 3.162278
log(x)
 [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101
 [8] 2.0794415 2.1972246 2.3025851
2^x
 [1]    2    4    8   16   32   64  128  256  512 1024

Note that the conditional function if-else does not vectorize. A particularly useful function is a vectorized version ifelse. Here is an example:

a <- c(0, 1, 2, -4, 5)
ifelse(a > 0, 1/a, NA)
[1]  NA 1.0 0.5  NA 0.2

Other conditional functions, such as any and all, do vectorize.

5.3 Indexing

Vectorization also works for logical relationships:

ind <- murders$population < 10^6

You can subset a vector using these:

murders$state[ind]
[1] "Alaska"               "Delaware"             "District of Columbia"
[4] "Montana"              "North Dakota"         "South Dakota"        
[7] "Vermont"              "Wyoming"             

You can also use vectorization to apply logical operators:

ind <- murders$population < 10^6 & murders$region == "West"
murders$state[ind]
[1] "Alaska"  "Montana" "Wyoming"

5.4 split

Split is a useful function to get indexes using a factor.

inds <- with(murders, split(seq_along(region), region))
murders$state[inds$West]
 [1] "Alaska"     "Arizona"    "California" "Colorado"   "Hawaii"    
 [6] "Idaho"      "Montana"    "Nevada"     "New Mexico" "Oregon"    
[11] "Utah"       "Washington" "Wyoming"   

5.5 Functions for subsetting

The functions which, match and the operator %in% are useful for sub-setting

Here are some examples:

ind <- which(murders$state == "California")
ind
[1] 5
murders[ind,]
       state abb region population total     rate
5 California  CA   West   37253956  1257 3.374138
ind <- match(c("New York", "Florida", "Texas"), murders$state)
ind
[1] 33 10 44
c("Boston", "Dakota", "Washington") %in% murders$state
[1] FALSE FALSE  TRUE

5.6 sapply

You can apply functions that don’t vectorize. Like this one:

s <- function(n){
   return(sum(1:n))
}

Try it on a vector:

ns <- c(25, 100, 1000)
s(ns)
Warning in 1:n: numerical expression has 3 elements: only the first used
[1] 325

We can use sapply

sapply(ns, s)
[1]    325   5050 500500

sapply will work on any vector, including lists.

5.7 Exercises

Now we are ready to help your friend. Let’s give them options of places with low murders rates, mountains, and not too small.

For the following exercises do no load any packages other than dslabs.

  1. Show the subset of murders showing states with less than 1 per 100,000 deaths. Show all variables.
if (exists("murders")) rm(murders)
library(dslabs)

murders$rate <- with(murders, total/population*10^5)
murders[murders$rate < 1,]
           state abb        region population total      rate
12        Hawaii  HI          West    1360301     7 0.5145920
13         Idaho  ID          West    1567582    12 0.7655102
16          Iowa  IA North Central    3046355    21 0.6893484
20         Maine  ME     Northeast    1328361    11 0.8280881
24     Minnesota  MN North Central    5303925    53 0.9992600
30 New Hampshire  NH     Northeast    1316470     5 0.3798036
35  North Dakota  ND North Central     672591     4 0.5947151
38        Oregon  OR          West    3831074    36 0.9396843
42  South Dakota  SD North Central     814180     8 0.9825837
45          Utah  UT          West    2763885    22 0.7959810
46       Vermont  VT     Northeast     625741     2 0.3196211
51       Wyoming  WY          West     563626     5 0.8871131
  1. Show the subset of murders showing states with less than 1 per 100,000 deaths and in the West of the US. Don’t show the region variable.
murders[murders$rate < 1 & murders$region == "West",]
     state abb region population total      rate
12  Hawaii  HI   West    1360301     7 0.5145920
13   Idaho  ID   West    1567582    12 0.7655102
38  Oregon  OR   West    3831074    36 0.9396843
45    Utah  UT   West    2763885    22 0.7959810
51 Wyoming  WY   West     563626     5 0.8871131
  1. Show the largest state with a rate less than 1 per 100,000.
dat <- murders[murders$rate < 1,]
dat[which.max(dat$population),]
       state abb        region population total    rate
24 Minnesota  MN North Central    5303925    53 0.99926
  1. Show the state with a population of more than 10 million with the lowest rate.
dat <- murders[murders$population >= 10^7,]
dat[which.min(dat$rate),]
      state abb    region population total    rate
33 New York  NY Northeast   19378102   517 2.66796
  1. Compute the rate for each region of the US.
indexes <- split(1:nrow(murders), murders$region)
sapply(indexes, function(ind) {
  sum(murders$total[ind])/sum(murders$population[ind])*10^5
})
    Northeast         South North Central          West 
     2.655592      3.626558      2.731334      2.656175 

More practice exercises:

  1. Create a vector of numbers that starts at 6, does not pass 55, and adds numbers in increments of 4/7: 6, 6 + 4/7, 6 + 8/7, and so on. How many numbers does the list have? Hint: use seq and length.

  2. Make this data frame:

temp <- c(35, 88, 42, 84, 81, 30)
city <- c("Beijing", "Lagos", "Paris", "Rio de Janeiro", 
          "San Juan", "Toronto")
city_temps <- data.frame(name = city, temperature = temp)

Convert the temperatures to Celsius.

  1. Compute the following sum

\[ S_n = 1+1/2^2 + 1/3^2 + \dots 1/n^2 \]

Show that as \(n\) gets bigger we get closer \(\pi^2/6\).

  1. Use the %in% operator and the predefined object state.abb to create a logical vector that answers the question: which of the following are actual abbreviations: MA, ME, MI, MO, MU?

  2. Extend the code you used in the previous exercise to report the one entry that is not an actual abbreviation. Hint: use the ! operator, which turns FALSE into TRUE and viceversa, then which to obtain an index.

  3. Show all variables for New York, California, and Texas, in that order.