Problem set 2

Published

September 18, 2025

For these exercises, do not load any packages other than dslabs.

Make sure to use vectorization whenever possible.

What is the sum of the first 100 positive integers? Use the functions seq and sum to compute the sum with R for any n.

# Your code here

Load the US murders dataset from the dslabs package. Use the function str to examine the structure of the murders object. What are the column names used by the data frame for these five variables? Show the subset of murders showing states with less than 1 per 100,000 deaths. Show all variables.

library(dslabs)

Warning: package 'dslabs' was built under R version 4.4.3

str(murders)

'data.frame':   51 obs. of  5 variables:
 $ state     : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
 $ abb       : chr  "AL" "AK" "AZ" "AR" ...
 $ region    : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
 $ population: num  4779736 710231 6392017 2915918 37253956 ...
 $ total     : num  135 19 232 93 1257 ...

# Your code here

Show the subset of murders showing states with less than 1 per 100,000 deaths and in the West of the US. Don’t show the region variable.

# Your code here

Show the largest state with a rate less than 1 per 100,000.

# Your code here

Show the state with a population of more than 10 million with the lowest rate.

# Your code here

Compute the rate for each region of the US.

# Your code here

Create a vector of numbers that starts at 6, does not pass 55, and adds numbers in increments of 4/7: 6, 6 + 4/7, 6 + 8/7, and so on. How many numbers does the list have? Hint: use seq and length.

# Your code here

Make this data frame:

temp <- c(35, 88, 42, 84, 81, 30)
city <- c("Beijing", "Lagos", "Paris", "Rio de Janeiro", 
          "San Juan", "Toronto")
city_temps <- data.frame(name = city, temperature = temp)

Convert the temperatures to Celsius.

# Your code here

Write a function euler that compute the following sum for any \(n\):

\[ S_n = 1+1/2^2 + 1/3^2 + \dots 1/n^2 \]

# Your code here

Show that as \(n\) gets bigger we get closer \(\pi^2/6\) by plotting \(S_n\) versus \(n\) with a horizontal dashed line at \(\pi^2/6\).

# Your code here

Use the %in% operator and the predefined object state.abb to create a logical vector that answers the question: which of the following are actual abbreviations: MA, ME, MI, MO, MU?

# Your code here

Extend the code you used in the previous exercise to report the one entry that is not an actual abbreviation. Hint: use the ! operator, which turns FALSE into TRUE and vice versa, then which to obtain an index.

# Your code here

In the murders dataset, use %in% to show all variables for New York, California, and Texas, in that order.

# Your code here

Write a function called vandermonde_helper that for any \(x\) and \(n\), returns the vector \((1, x, x^2, x^3, \dots, x^n)\). Show the results for \(x=3\) and \(n=5\).

# Your code here

Create a vector using:

n <- 10000
p <- 0.5
set.seed(2024-9-6)
x <- sample(c(0,1), n, prob = c(1 - p, p), replace = TRUE)

Compute the length of each stretch of 1s and then plot the distribution of these values. Check to see if the distribution follows a geometric distribution as the theory predicts. Do not use a loop!

# Your code here

In the murders dataset, create a logical vector that indicates which states have both a murder rate higher than the national average AND a population greater than 5 million. Then use ifelse to create a character vector that labels these states as “High Crime, High Pop”, states with murder rate higher than average but population ≤ 5 million as “High Crime, Low Pop”, and all other states as “Lower Crime”.

# Your code here

Use order, rank, and sort functions on the murder rates to answer the following: What is the murder rate of the state that ranks 10th in terms of murder rate? Show your work by creating the murder rate vector, then using the appropriate function to find the 10th ranked value.

# Your code here

1. Write a function called compute_harmonic_mean that takes a numeric vector and returns the harmonic mean (which is \(n / \sum_{i=1}^{n} 1/x_i\)). The function should return NA if any values are zero or negative. Test your function on the vector c(1, 2, 4, 8) and show that it returns approximately 2.13.

# Your code here

Create a function called safe_divide that takes two arguments x and y and returns their ratio x/y, but returns the string “Cannot divide by zero” when y is zero. Use vectorization so that the function works element-wise on vectors. Test it on the vectors x <- c(10, 20, 30) and y <- c(2, 0, 5).

# Your code here

Using the murders dataset, write a function called classify_state_safety that takes a state name as input and returns a classification based on the murder rate: “Very Safe” (rate < 1), “Safe” (rate 1-3), “Moderate” (rate 3-5), or “High Risk” (rate > 5). If the state name is not found, return “State not found”. Test your function on “Vermont”, “Texas”, “California”, and “NotAState”. Then use sapply to classify all states and create a table showing how many states fall into each safety category using the table function.

# Your code here