url <- "https://raw.githubusercontent.com/datasciencelabs/2025/main/data/president_polls.csv"Problem set 7
In this problem set, you will build a statistical model to analyze the 2024 U.S. presidential election using the same polling data that was available to forecasters before the election. You’ll learn the methods used by election forecasters to predict both the popular vote margin and the electoral college outcome for the Harris-Trump race.
Your analysis will involve cleaning and processing polling data, handling the complexities of different poll types and populations, and dealing with missing data for states where few or no polls are conducted. You’ll use both frequentist and Bayesian approaches to quantify uncertainty in your predictions.
This exercise will give you hands-on experience with real-world data science challenges including data cleaning, model selection, uncertainty quantification, and the practical difficulties of election forecasting. At the end, you’ll be able to compare your forecasts to the actual election results to evaluate your model’s performance.
Important: Do not look up the actual 2024 election results until after you complete your analysis. The goal is to understand what forecasters could reasonably predict given the available data.
- Read in the data provided here:
Examine the data frame paying particular attention to the poll_id, question_id, population, and answer. Note that some polls have more than one question based on different population types.
library(tidyverse)
library(rvest)
raw_dat <- ### Your code hereNote: The candidate information is stored in the answer column, which contains values like “Harris”, “Trump”, etc.
- Polls are based on either likely voters (lv), registered voters (rv), all voters (a), or voters (v). Polls based on ‘voters’ are exit polls. We want to remove these because exit polls are too old or might be biased due to differences in the likelihood of early voting by party. We prefer likely voter (lv) polls because they are more predictive than registered voter polls. Registered voter polls are more predictive than all voter (a) polls. Remove the exit poll (v) polls and then redefine
populationto be a factor ordered from best to worst predictive power: (lv, rv, a). You should also remove hypothetical polls (if any) and convert the date columns to date objects. Name the resulting data framedat.
dat <- raw_dat |>
## Your code here- Some polls asked more than one question. So if you filter to one poll ID in our dataset, you might see more than one question ID associated with the same poll. The most common reason for this is that they asked a head-to-head question (Harris versus Trump) and, in the same poll, a question about all candidates. We want to prioritize the head-to-head questions.
Add a column that tells us, for each question, how many candidates were mentioned in that question.
Add a new column n to dat that provides the number of candidates mentioned for each question. For example the relevant column of your final table will looks something like this:
poll_id |
question_id |
candidate |
n |
|---|---|---|---|
| 1 | 1 | Harris | 2 |
| 1 | 1 | Trump | 2 |
| 1 | 2 | Harris | 3 |
| 1 | 2 | Trump | 3 |
| 1 | 2 | Stein | 3 |
dat <- dat |>
## Your code here- We are going to focus on the Harris versus Trump comparison. Redefine
datto only include the rows providing information for Harris and Trump. Then pivot the dataset so that the percentages for Harris and Trump are in their own columns. Note that for pivot to work you will have to remove some columns. To avoid this keep only the columns you are pivoting and along withpoll_id,question_id,state,pollster,start_date,end_date,numeric_grade,sample_size. Once you accomplish the pivot, add a column calledspreadwith the difference between Harris and Trump.
Note that the values stored in spread are estimates of the popular vote difference:
spread = % of the popular vote for Harris - % of the popular vote for Trump
However, for the calculations in the rest of problem set to be consistent with the sampling model we have been discussing in class, save spread as a proportion, not a percentage.
dat <- dat |>
## Your code here- Note that some polls have multiple questions. We want to keep only one question per poll. We will keep likely voter (lv) polls when available, and prefer registered voter (rv) over all voter polls (a). If more than one question was asked in one poll, take the most targeted question (smallest
n). Save the resulting tabledat. Note that after you do this, each row will represent exactly one poll/question, so can removen,poll_idandquestion_id.
dat <- dat |>
## Your code here- Separate
datinto two data frames: one with popular vote polls and one with state level polls. Call thempopular_voteandpollsrespectively.
popular_vote <- ## Your code here
polls <- ## Your code here- For the popular vote, plot the spread reported by each poll against start date for polls starting after July 21, 2024. Rename all the pollsters with less than 5 polls during this period as
Other. Use color to denote pollster. Make separate plots for likely voters and registered voters. Do not use all voter polls (a). Usegeom_smoothwith methodloessto show a curve going through the points. You can control how adaptive the curve is through thespanargument.
popular_vote |>
filter(start_date > make_date(2024, 7, 21) & population != "a") |>
## Your code here- To show the pollster effect, make boxplots for the spread for each popular vote poll. Include only likely voter polls starting after July 21, 2024. Rename all the pollsters with less than 5 polls during that time period as
Other.
popular_vote |>
filter(start_date > make_date(2024, 7, 21) & population == "lv") |>
## Your code here- Compute a prediction and a confidence interval for the popular vote spread. Include the code you used to create your confidence interval below:
## Your code hereWe now move on to predicting the electoral votes.
- To obtain the number of electoral votes for each state we will visit this website:
url <- "https://state.1keydata.com/state-electoral-votes.php"We can use the rvest package to download and extract the relevant table:
library(rvest)
h <- read_html(url) |>
html_table()
ev <- h[[4]]Wrangle the data in ev to only have two columns state and electoral_votes. Make sure the electoral vote column is numeric. Add the electoral votes for Maine CD-1 (1), Maine CD-2 (1), Nebraska CD-2 (1), and District of Columbia (3) by hand.
## Your code here- The presidential race in some states is a forgone conclusion. Because there is practically no uncertainty in who will win, polls are not taken. We will therefore assume that the party that won in 2020 will win again in 2024 if no polls are being collected for a state.
Download the following sheet:
library(gsheet)
sheet_url <- "https://docs.google.com/spreadsheets/d/1D-edaVHTnZNhVU840EPUhz3Cgd7m39Urx7HM8Pq6Pus/edit?gid=29622862"
raw_res_2020 <- gsheet2tbl(sheet_url) Tidy the raw_res_2020 dataset so that you have two columns state and party, with D and R in the party column to indicate who won in 2020. Add Maine CD-1 (D), Maine CD-2 (R), Nebraska CD-2 (D), and District of Columbia (D) by hand. Save the result to res_2020. Hint use the janitor row_to_names function.
library(janitor)
res_2020 <- raw_res_2020 |>
## Your code here- Decide on a period that you will use to compute your prediction. We will use
spreadas the outcome. Make sure the outcome is saved as a proportion not percentage. Create aresultsdata frame with columnsstate,avg,sd,nandelectoral_votes, with one row per state.
Some ideas and recommendations:
- If a state has enough polls, consider a short period, such as a week. For states with few polls you might need to increase the interval to increase the number of polls.
- Decide which polls to prioritize based on the
populationandnumeric_gradecolumns. - You might want to weigh them differently, in which you might also consider using
sample_size. - If you use fewer than 5 polls to calculate an average, your estimate of the standard deviation (SD) may be unreliable. With only one poll, you wont be able to estimate the SD at all. In these cases, consider using the SD from similar states to avoid unusual or inaccurate estimates.
results <- polls |>
## Your code here- Note you will not have polls for all states. Assume that lack of polls implies the state is not in play. Use the
res_2020data frame to compute the electoral votes Harris is practically guaranteed to have.
harris_start <- ## Your code here- Use a Bayesian approach to compute posterior means and standard deviations for each state in
results. Plot the posterior mean versus the observed average with the size of the point proportional to the number of polls.
## Your code here- Compute a prediction and a confidence interval for Harris’ electoral votes. Include the code you used to create your estimate and interval below:
## Your code here