library(dslabs)
5 Vectorization
A European friend has a great job offer from USA but is concerned about gun violence.
The murders
dataset in the dslabs package includes data on gun murders for the US 50 states and DC. Use this to prepare a report for your fried to help them decide where to live. Note your friend likes hiking so might prefer the west. Your friend does not like low population density.
5.1 Arithmetics
<- c(69, 62, 66, 70, 70, 73, 67, 73, 67, 70) heights
Convert to meters:
* 2.54 / 100 heights
[1] 1.7526 1.5748 1.6764 1.7780 1.7780 1.8542 1.7018 1.8542 1.7018 1.7780
Difference from the average:
<- mean(heights)
avg - avg heights
[1] 0.3 -6.7 -2.7 1.3 1.3 4.3 -1.7 4.3 -1.7 1.3
Exercise: compute the height in standardized units
<- sd(heights)
s - avg) / s (heights
[1] 0.08995503 -2.00899575 -0.80959530 0.38980515 0.38980515 1.28935548
[7] -0.50974519 1.28935548 -0.50974519 0.38980515
# can also use scale(heights)
If it’s two vectors, it does it component wise:
<- c(69, 62, 66, 70, 70, 73, 67, 73, 67, 70)
heights <- rnorm(length(heights), 0, 0.1)
error + error heights
[1] 68.76366 61.95831 66.23369 70.01296 70.00050 73.08083 66.91846 73.23657
[9] 66.98593 69.99302
Exercise:
Add a column to the murders dataset with the murder rate in per 100,000.
library(dslabs)
$rate <- with(murders, total / population * 10^5) murders
5.2 Functions that vectorize
Most arithmetic functions work on vectors
<- 1:10
x sqrt(x)
[1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427
[9] 3.000000 3.162278
log(x)
[1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101
[8] 2.0794415 2.1972246 2.3025851
2^x
[1] 2 4 8 16 32 64 128 256 512 1024
Note that the conditional function if
-else
does not vectorize. A particularly useful function is a vectorized version ifelse
. Here is an example:
<- c(0, 1, 2, -4, 5)
a ifelse(a > 0, 1/a, NA)
[1] NA 1.0 0.5 NA 0.2
Other conditional functions, such as any
and all
, do vectorize.
5.3 Indexing
Vectorization also works for logical relationships:
<- murders$population < 10^6 ind
You can subset a vector using these:
$state[ind] murders
[1] "Alaska" "Delaware" "District of Columbia"
[4] "Montana" "North Dakota" "South Dakota"
[7] "Vermont" "Wyoming"
You can also use vectorization to apply logical operators:
<- murders$population < 10^6 & murders$region == "West"
ind $state[ind] murders
[1] "Alaska" "Montana" "Wyoming"
5.4 split
Split is a useful function to get indexes using a factor.
<- with(murders, split(seq_along(region), region))
inds $state[inds$West] murders
[1] "Alaska" "Arizona" "California" "Colorado" "Hawaii"
[6] "Idaho" "Montana" "Nevada" "New Mexico" "Oregon"
[11] "Utah" "Washington" "Wyoming"
5.5 Functions for subsetting
The functions which
, match
and the operator %in%
are useful for sub-setting
Here are some examples:
<- which(murders$state == "California")
ind ind
[1] 5
murders[ind,]
state abb region population total rate
5 California CA West 37253956 1257 3.374138
<- match(c("New York", "Florida", "Texas"), murders$state)
ind ind
[1] 33 10 44
c("Boston", "Dakota", "Washington") %in% murders$state
[1] FALSE FALSE TRUE
5.6 sapply
You can apply functions that don’t vectorize. Like this one:
<- function(n){
s return(sum(1:n))
}
Try it on a vector:
<- c(25, 100, 1000)
ns s(ns)
Warning in 1:n: numerical expression has 3 elements: only the first used
[1] 325
We can use sapply
sapply(ns, s)
[1] 325 5050 500500
sapply
will work on any vector, including lists.
5.7 Exercises
Now we are ready to help your friend. Let’s give them options of places with low murders rates, mountains, and not too small.
For the following exercises do no load any packages other than dslabs.
- Show the subset of
murders
showing states with less than 1 per 100,000 deaths. Show all variables.
if (exists("murders")) rm(murders)
library(dslabs)
$rate <- with(murders, total/population*10^5)
murders$rate < 1,] murders[murders
state abb region population total rate
12 Hawaii HI West 1360301 7 0.5145920
13 Idaho ID West 1567582 12 0.7655102
16 Iowa IA North Central 3046355 21 0.6893484
20 Maine ME Northeast 1328361 11 0.8280881
24 Minnesota MN North Central 5303925 53 0.9992600
30 New Hampshire NH Northeast 1316470 5 0.3798036
35 North Dakota ND North Central 672591 4 0.5947151
38 Oregon OR West 3831074 36 0.9396843
42 South Dakota SD North Central 814180 8 0.9825837
45 Utah UT West 2763885 22 0.7959810
46 Vermont VT Northeast 625741 2 0.3196211
51 Wyoming WY West 563626 5 0.8871131
- Show the subset of
murders
showing states with less than 1 per 100,000 deaths and in the West of the US. Don’t show theregion
variable.
$rate < 1 & murders$region == "West",] murders[murders
state abb region population total rate
12 Hawaii HI West 1360301 7 0.5145920
13 Idaho ID West 1567582 12 0.7655102
38 Oregon OR West 3831074 36 0.9396843
45 Utah UT West 2763885 22 0.7959810
51 Wyoming WY West 563626 5 0.8871131
- Show the largest state with a rate less than 1 per 100,000.
<- murders[murders$rate < 1,]
dat which.max(dat$population),] dat[
state abb region population total rate
24 Minnesota MN North Central 5303925 53 0.99926
- Show the state with a population of more than 10 million with the lowest rate.
<- murders[murders$population >= 10^7,]
dat which.min(dat$rate),] dat[
state abb region population total rate
33 New York NY Northeast 19378102 517 2.66796
- Compute the rate for each region of the US.
<- split(1:nrow(murders), murders$region)
indexes sapply(indexes, function(ind) {
sum(murders$total[ind])/sum(murders$population[ind])*10^5
})
Northeast South North Central West
2.655592 3.626558 2.731334 2.656175
More practice exercises:
Create a vector of numbers that starts at 6, does not pass 55, and adds numbers in increments of 4/7: 6, 6 + 4/7, 6 + 8/7, and so on. How many numbers does the list have? Hint: use
seq
andlength
.Make this data frame:
<- c(35, 88, 42, 84, 81, 30)
temp <- c("Beijing", "Lagos", "Paris", "Rio de Janeiro",
city "San Juan", "Toronto")
<- data.frame(name = city, temperature = temp) city_temps
Convert the temperatures to Celsius.
- Compute the following sum
\[ S_n = 1+1/2^2 + 1/3^2 + \dots 1/n^2 \]
Show that as \(n\) gets bigger we get closer \(\pi^2/6\).
Use the
%in%
operator and the predefined objectstate.abb
to create a logical vector that answers the question: which of the following are actual abbreviations: MA, ME, MI, MO, MU?Extend the code you used in the previous exercise to report the one entry that is not an actual abbreviation. Hint: use the
!
operator, which turnsFALSE
intoTRUE
and viceversa, thenwhich
to obtain an index.Show all variables for New York, California, and Texas, in that order.