Problem set 7

Published

November 5, 2025

In this problem set, you will build a statistical model to analyze the 2024 U.S. presidential election using the same polling data that was available to forecasters before the election. You’ll learn the methods used by election forecasters to predict both the popular vote margin and the electoral college outcome for the Harris-Trump race.

Your analysis will involve cleaning and processing polling data, handling the complexities of different poll types and populations, and dealing with missing data for states where few or no polls are conducted. You’ll use both frequentist and Bayesian approaches to quantify uncertainty in your predictions.

This exercise will give you hands-on experience with real-world data science challenges including data cleaning, model selection, uncertainty quantification, and the practical difficulties of election forecasting. At the end, you’ll be able to compare your forecasts to the actual election results to evaluate your model’s performance.

Important: Do not look up the actual 2024 election results until after you complete your analysis. The goal is to understand what forecasters could reasonably predict given the available data.

Read in the data provided here:

url <- "https://raw.githubusercontent.com/datasciencelabs/2025/main/data/president_polls.csv"

Examine the data frame paying particular attention to the poll_id, question_id, population, and answer. Note that some polls have more than one question based on different population types.

library(tidyverse)
library(rvest)
raw_dat <- ### Your code here

Note: The candidate information is stored in the answer column, which contains values like “Harris”, “Trump”, etc.

Polls are based on either likely voters (lv), registered voters (rv), all voters (a), or voters (v). Polls based on ‘voters’ are exit polls. We want to remove these because exit polls are too old or might be biased due to differences in the likelihood of early voting by party. We prefer likely voter (lv) polls because they are more predictive than registered voter polls. Registered voter polls are more predictive than all voter (a) polls. Remove the exit poll (v) polls and then redefine population to be a factor ordered from best to worst predictive power: (lv, rv, a). You should also remove hypothetical polls (if any) and convert the date columns to date objects. Name the resulting data frame dat.

dat <- raw_dat |> 
## Your code here

Some polls asked more than one question. So if you filter to one poll ID in our dataset, you might see more than one question ID associated with the same poll. The most common reason for this is that they asked a head-to-head question (Harris versus Trump) and, in the same poll, a question about all candidates. We want to prioritize the head-to-head questions.

Add a column that tells us, for each question, how many candidates were mentioned in that question.

Add a new column n to dat that provides the number of candidates mentioned for each question. For example the relevant column of your final table will looks something like this:

`poll_id`	`question_id`	`candidate`	`n`
1	1	Harris	2
1	1	Trump	2
1	2	Harris	3
1	2	Trump	3
1	2	Stein	3

dat <- dat |> 
## Your code here

We are going to focus on the Harris versus Trump comparison. Redefine dat to only include the rows providing information for Harris and Trump. Then pivot the dataset so that the percentages for Harris and Trump are in their own columns. Note that for pivot to work you will have to remove some columns. To avoid this keep only the columns you are pivoting and along with poll_id, question_id, state, pollster, start_date, end_date, numeric_grade, sample_size. Once you accomplish the pivot, add a column called spread with the difference between Harris and Trump.

Note that the values stored in spread are estimates of the popular vote difference:

spread = % of the popular vote for Harris - % of the popular vote for Trump

However, for the calculations in the rest of problem set to be consistent with the sampling model we have been discussing in class, save spread as a proportion, not a percentage.

dat <- dat |>
## Your code here

Note that some polls have multiple questions. We want to keep only one question per poll. We will keep likely voter (lv) polls when available, and prefer registered voter (rv) over all voter polls (a). If more than one question was asked in one poll, take the most targeted question (smallest n). Save the resulting table dat. Note that after you do this, each row will represent exactly one poll/question, so can remove n, poll_id and question_id.

dat <- dat |>
## Your code here

Separate dat into two data frames: one with popular vote polls and one with state level polls. Call them popular_vote and polls respectively.

popular_vote <- ## Your code here
polls <- ## Your code here

For the popular vote, plot the spread reported by each poll against start date for polls starting after July 21, 2024. Rename all the pollsters with less than 5 polls during this period as Other. Use color to denote pollster. Make separate plots for likely voters and registered voters. Do not use all voter polls (a). Use geom_smooth with method loess to show a curve going through the points. You can control how adaptive the curve is through the span argument.

popular_vote |> 
  filter(start_date > make_date(2024, 7, 21) & population != "a") |>
## Your code here

To show the pollster effect, make boxplots for the spread for each popular vote poll. Include only likely voter polls starting after July 21, 2024. Rename all the pollsters with less than 5 polls during that time period as Other.

popular_vote |> 
  filter(start_date > make_date(2024, 7, 21) & population == "lv") |>
## Your code here

Compute a prediction and a confidence interval for the popular vote spread. Include the code you used to create your confidence interval below:

## Your code here

We now move on to predicting the electoral votes.

To obtain the number of electoral votes for each state we will visit this website:

url <- "https://state.1keydata.com/state-electoral-votes.php"

We can use the rvest package to download and extract the relevant table:

library(rvest)
h <- read_html(url) |>
  html_table() 

ev <- h[[4]]

Wrangle the data in ev to only have two columns state and electoral_votes. Make sure the electoral vote column is numeric. Add the electoral votes for Maine CD-1 (1), Maine CD-2 (1), Nebraska CD-2 (1), and District of Columbia (3) by hand.

## Your code here

The presidential race in some states is a forgone conclusion. Because there is practically no uncertainty in who will win, polls are not taken. We will therefore assume that the party that won in 2020 will win again in 2024 if no polls are being collected for a state.

Download the following sheet:

library(gsheet)
sheet_url <- "https://docs.google.com/spreadsheets/d/1D-edaVHTnZNhVU840EPUhz3Cgd7m39Urx7HM8Pq6Pus/edit?gid=29622862"
raw_res_2020 <- gsheet2tbl(sheet_url)

Tidy the raw_res_2020 dataset so that you have two columns state and party, with D and R in the party column to indicate who won in 2020. Add Maine CD-1 (D), Maine CD-2 (R), Nebraska CD-2 (D), and District of Columbia (D) by hand. Save the result to res_2020. Hint use the janitor row_to_names function.

library(janitor)
res_2020 <- raw_res_2020 |>
## Your code here

Decide on a period that you will use to compute your prediction. We will use spread as the outcome. Make sure the outcome is saved as a proportion not percentage. Create a results data frame with columns state, avg, sd, n and electoral_votes, with one row per state.

Some ideas and recommendations:

If a state has enough polls, consider a short period, such as a week. For states with few polls you might need to increase the interval to increase the number of polls.
Decide which polls to prioritize based on the population and numeric_grade columns.
You might want to weigh them differently, in which you might also consider using sample_size.
If you use fewer than 5 polls to calculate an average, your estimate of the standard deviation (SD) may be unreliable. With only one poll, you wont be able to estimate the SD at all. In these cases, consider using the SD from similar states to avoid unusual or inaccurate estimates.

results <- polls |> 
## Your code here

Note you will not have polls for all states. Assume that lack of polls implies the state is not in play. Use the res_2020 data frame to compute the electoral votes Harris is practically guaranteed to have.

harris_start <- ## Your code here

Use a Bayesian approach to compute posterior means and standard deviations for each state in results. Plot the posterior mean versus the observed average with the size of the point proportional to the number of polls.

## Your code here

Compute a prediction and a confidence interval for Harris’ electoral votes. Include the code you used to create your estimate and interval below:

## Your code here