This homework is due Tuesday March 8, 2016 at 8PM EST. When complete, submit your code in an R Markdown file and the knitted HTML via GitHub.

Motivation

In 2012 Nate Silver, and other data scientists, predicted the outcome of each state correctly. They did this by aggregating data from many polls to create more precise estimates than what one single poll can provide.

In this homework, we will try to predict the results of the democratic and republican primaries by studying the performance of polls in elections that already occurred and then aggregating results.

Problem 1

The first step in our analysis will be to wrangle the data in a way that will simplify the analysis. Ultimately, we want a table of results with each poll represented by a row and including results for each candidate as well as information about the poll such as name and date.

Problem 1A

Install and load the pollstR package. This package provides functions to access data in the Huffington Post’s database. Read the help file for the pollstr_polls() function and write a function that reads all the polls related to the republican primaries. Name the object race2016. Hint: Visit this webpage to select the right topic and make sure to change the max_pages argument.

##Your code here

library(pollstR)
race2016 <- pollstr_polls(topic = "2016-president-gop-primary", 
                       max_pages = Inf)

Problem 1B

Examine and familiarize yourself with the race2016 object. Note that the questions component has a table with election results. Look at the topic component of the questions component. Create a new table with only the results from the 2016-president-gop-primary and only state (or territory) polls, no national polls. Hint: create a new object called results with the table of results and use dplyr. How many rows are we left with?

##Your code here

library(dplyr)
results <- race2016$questions
results <- filter(results, topic=="2016-president-gop-primary" & 
                    state != "US")
results %>% summarize(n())

##    n()
## 1 4989

Problem 1C

In Problem 1B, we created a table called results with over 4000 rows. Does this mean that we have data for 4000 polls? How many polls did we actually have? Hint: look at the id column and use the group_by command.

##Your code here

## Each poll has an `id`
## We can see the number of groups at top
results %>% group_by(id)

## Source: local data frame [4,989 x 14]
## Groups: id [414]
## 
##                                         question
##                                            (chr)
## 1   2016 Florida Republican Presidential Primary
## 2   2016 Florida Republican Presidential Primary
## 3   2016 Florida Republican Presidential Primary
## 4   2016 Florida Republican Presidential Primary
## 5  2016 Maryland Republican Presidential Primary
## 6  2016 Maryland Republican Presidential Primary
## 7  2016 Maryland Republican Presidential Primary
## 8  2016 Maryland Republican Presidential Primary
## 9  2016 Maryland Republican Presidential Primary
## 10 2016 Maryland Republican Presidential Primary
## ..                                           ...
## Variables not shown: chart (chr), topic (chr), state (chr), subpopulation
##   (chr), observations (int), margin_of_error (dbl), choice (chr), value
##   (int), first_name (chr), last_name (chr), party (chr), incumbent (lgl),
##   id (int)

Problem 1D

Look at the first row of your results table. What date was this poll conducted? Hint: Use the polls component of the race2016 object to find the date.

##Your code here

## this does not have a date
slice(results,1)

##                                       question
## 1 2016 Florida Republican Presidential Primary
##                                          chart                      topic
## 1 2016-florida-presidential-republican-primary 2016-president-gop-primary
##   state              subpopulation observations margin_of_error choice
## 1    FL Likely Voters - Republican          904             3.3   Cruz
##   value first_name last_name party incumbent    id
## 1    14        Ted      Cruz   Rep     FALSE 24051

## we can see that the id is 
the_id <- slice(results,1)$id
the_id

## [1] 24051

## and then find the date for that id in the polls table
race2016$polls %>% filter(id==the_id) %>% select(start_date:end_date)

##   start_date   end_date
## 1 2016-03-07 2016-03-08

Problem 1E

Now examine the candidates in the “choices” column included in results table. Hint: use the table() function. Note that there are several choices that not going to be informative. For example, we have candidates that have dropped out. We also have entries such as No one, No One and No Preference. Filter the results table to include only Rubio and Trump.

##Your code here

##Look at all the choices?
table(results$choice)

## 
##                 Bolton                   Bush                 Carson 
##                      4                    327                    364 
##               Christie                Clinton                   Cruz 
##                    287                      2                    419 
##             Don't Know     Don't know/Refused                Ehrlich 
##                      9                      1                      2 
##                Everson                Fiorina                Gilmore 
##                      1                    248                    118 
##                 Graham                   Gray               Huckabee 
##                    145                      2                    256 
##               Huntsman               Jeb Bush                 Jindal 
##                      1                      1                    139 
##                 Kasich                   King               Martinez 
##                    370                      2                      3 
##              No Answer                 No one                 No One 
##                      6                      3                      2 
##          No Preference                   None      None of the above 
##                      4                     12                      2 
##               Not Sure             Not voting               O'Malley 
##                      2                      2                      1 
##                  Other        Other/Undecided                  Palin 
##                    169                      7                      5 
##                 Pataki                   Paul                  Pence 
##                    123                    273                      3 
##                  Perry                Portman                Refused 
##                     85                      1                     17 
##         Refused (Vol.)                   Rice                 Romney 
##                      3                      2                      1 
##                  Rubio                   Ryan                Sanders 
##                    423                     29                      2 
##               Santorum            Scarborough           Someone else 
##                    236                      2                      3 
##  Someone else/Not sure Someone else/undecided Someone else/Undecided 
##                     20                      2                      1 
##                  Trump            Uncommitted              Undecided 
##                    371                      4                    341 
##       Undecided (Vol.)       Undecided/ Other        Undecided/Other 
##                      3                      1                      1 
##                 Unsure                 Walker          Wouldn't vote 
##                      1                    111                     12 
##          Wouldn't Vote 
##                      2

candidates <- c("Rubio","Trump")
results <- filter(results, choice %in% candidates)

Problem 1F

In our results table, we have one row for each candidate in each poll. Transform the results table to have one row for each poll and columns for each Rubio and Trump. Next, create a column called diff with the difference between Trump and Rubio. Hint: Remove the first_name and last_name columns then use the tidyr function spread().

##Your code here

library(tidyr)
results <- select(results, -first_name, -last_name) %>%
    distinct() %>%
    spread(choice, value) %>% 
    mutate(diff = Trump - Rubio)

Problem 1G

For each poll in the results table, we want to know the start date and the end date of the poll along with the pollster name and the type of poll it was. Hint: This information is in the polls component of race2016. You can select the relevant columns then use the id column to join the tables. One of the join functions in tidyr will do the trick.

##Your code here

results <- race2016$polls %>% 
                select(id, pollster, method, start_date, end_date) %>% 
                right_join(results, by="id")

Problem 1H

Study the type of values in the pollster column. Notice that you have many different values but that certain names commonly appear in these values. For example, consider the name “NBC” in the pollster column. NBC here is the Survey House. Use a join function again to add the survey house to the results table. Rename the column house. Hint: race2016$survey_house has the information you need.

##Your code here

results <- race2016$survey_houses %>% 
                select(id, name) %>% 
                right_join(results, by="id") %>% 
                rename(house=name)

Problem 2

We now have a table with all the information we need. We will now use the results from Iowa, New Hampshire, Nevada and South Carolina to determine how to create a prediction for upcoming primaries.

Problem 2A

Use an internet search to determine the results for the Iowa, New Hampshire, Nevada and South Carolina primaries for the top three candidates. Create a table called actual with this information. Also, create a column with the actual election difference. Use a join function to add this information to our results table.

##Your code here

##In case we need others we include them
Cruz <- c(27.0, 11.6, 22.3, 21.5)
Trump <-    c(24.3, 35.3, 32.4, 46.1)
Rubio   <-  c(23.1, 10.5, 22.4, 24.0)
Kasich <- c(1.9, 15.8, 3.6, 7.6)
Carson <- c(9.3, 2.3, 4.8, 7.2)

actual <- data.frame( cbind(Trump, Rubio, Cruz, Kasich, Carson))
actual <- select(actual, Trump, Rubio)
names(actual) <- paste("actual",names(actual), sep="_")

actual$actual_diff <- actual[,1]-actual[,2]

actual$state <- c("IA", "NH", "NV", "SC")
actual$election_date <- as.Date(c("2016-02-01","2016-02-09","2016-02-20","2016-02-23"))

results <- left_join(results, actual, by="state")

Problem 2B

Create boxplots of the poll results for Trump in Iowa stratified by the pollster survey house for polls having more than 4 total results. Add a horizontal line with the actual results. Hint: Use the group_by, mutate, filter and ungroup functions in dplyr for the filtering step.

##Your code here

library(ggplot2)
theme_set(theme_bw())
results %>% 
  filter(!is.na(Trump) & state == "IA") %>%
  group_by(house) %>% 
  mutate( number = n()) %>% 
  filter( number > 4) %>% ungroup %>%
  ggplot( aes(house, Trump, fill=house)) +
  geom_boxplot() +
  geom_hline(aes( yintercept = actual_Trump)) + 
    facet_wrap(~state)

Problem 2C

Using the poll results for Trump in Iowa, compute the standard deviation for the results from each pollster house for polls having more than 4 total results. Then, study the typical standard deviation sizes used in these polls. Create a new table with two columns: the observed standard deviation and the standard deviations that theory predicts. For the prediction you have several observations. Pick the smallest one. Which is larger, the observed or the theoretical?

##Your code here

mysd <- function(x) sqrt( mean( (x-mean(x))^2))
results %>% 
  filter(!is.na(Trump) & state=="IA") %>%
  group_by(house) %>% 
  mutate( number = n()) %>% 
  filter( number > 4) %>% 
  summarize( "observedSD" = mysd(Trump),
             "theoreticalSD" = 
                 100*sqrt(unique(actual_Trump)/100*(1-unique(actual_Trump)/100))/
               sqrt(min(observations)))

## Source: local data frame [9 x 3]
## 
##                       house observedSD theoreticalSD
##                       (chr)      (dbl)         (dbl)
## 1          Gravis Marketing   4.531072      2.576981
## 2             Loras College   8.447485      1.919998
## 3            Marist College   5.114685      2.319199
## 4       Monmouth University   5.678028      2.144475
## 5             Opinion Savvy   6.800000      1.825499
## 6 Public Policy Polling (D)   4.068852      1.941517
## 7     Quinnipiac University   6.799586      1.790174
## 8              Selzer & Co.   9.463140      2.144475
## 9                    YouGov   4.118252      1.933608

Problem 2D

Now using the data from Problem 2C, plot the individual values against the time the poll was taken (use the end_date). Repeat this for each of the four states. Use color to denote pollster house. Using this plot, explain why the theory does not match the observed results?

##Your code here

results %>% 
  filter(!is.na(Trump) & state%in%c("IA","NH","SC","NV")) %>%
  group_by(house) %>% 
  mutate( number = n()) %>% 
  filter( number > 4) %>% 
  ungroup %>%
  ggplot( aes(end_date, Trump, col=house)) + geom_point() + 
    geom_hline(aes(yintercept=actual_Trump)) + facet_wrap(~state)

## The parameter being estimated is changing with time

Problem 2E

Consider the Trump - Rubio difference. For each poll in IA, NH, SC and NV, compute the error between the prediction and actual election results. Use exploratory data analysis to get an idea of how time and pollster impacts accuracy.

##Your code here

tmp <- results %>% 
            filter(!is.na(diff) & state%in%c("IA","NH","SC","NV")) %>%
            mutate(error=diff - actual_diff) %>%
            group_by(house) %>% 
            mutate( number = n()) %>% 
            filter( number > 9) %>% 
            ungroup 

tmp %>% 
    ggplot( aes(end_date, error, col=house, pch=state)) + geom_point()

tmp %>% 
    ggplot( aes(house, error, fill=house)) + geom_boxplot()

Problem 2F

For polls from IA, NH, and SC, aggregate all polls from within 1 week of the election (use the start_date to determine cutoff) to provide a 95% confidence interval for the difference between Trump and Rubio. Compare the following two approaches: (1) the method that assumes that all variance comes from sampling error and (2) the approach that estimates variance empirically.

##Your code here

# assumes all variance comes from sampling error
results %>% 
    filter(!is.na(diff) & state%in%c("IA","NH","SC") & 
               election_date - start_date <= 7) %>%
    group_by(state) %>% 
    summarize(p_hat = mean(diff),
              sd = 100 * 2 * sqrt(p_hat / 100 * (1 - p_hat / 100)) / sqrt(min(observations)),
              lower = p_hat - qnorm(0.975) * sd,
              upper = p_hat + qnorm(0.975) * sd)

## Source: local data frame [3 x 5]
## 
##   state     p_hat       sd    lower    upper
##   (chr)     (dbl)    (dbl)    (dbl)    (dbl)
## 1    IA  7.888889 3.123099 1.767727 14.01005
## 2    NH 16.777778 3.927920 9.079197 24.47636
## 3    SC 12.000000 3.245561 5.638817 18.36118

# estimates variance empirically
results %>% 
    filter(!is.na(diff) & state%in%c("IA","NH","SC") & 
               election_date - start_date <= 7) %>%
    group_by(state) %>% 
    summarize(p_hat = mean(diff),
              sd = mysd(diff),
              lower = mean(diff) - qnorm(0.975) * mysd(diff)/sqrt(n()),
              upper = mean(diff) + qnorm(0.975) * mysd(diff)/sqrt(n()))

## Source: local data frame [3 x 5]
## 
##   state     p_hat       sd     lower    upper
##   (chr)     (dbl)    (dbl)     (dbl)    (dbl)
## 1    IA  7.888889 7.046582  3.285206 12.49257
## 2    NH 16.777778 3.659926 15.087009 18.46855
## 3    SC 12.000000 4.774935  7.814662 16.18534

Problem 3

Before seeing any polls my prior belief is that Rubio will beat Trump in Florida. If I were to quantify this belief I would say that the distribution of the Trump - Rubio was normal with mean $\mu=-20$ percent and standard deviation $\tau=10$. Let’s call the difference $\theta$. Then

\[ \theta \sim N( \mu, \tau^2) \]

Problem 3A

Under my prior belief, what is the chance that Trump would beat Rubio in Florida.

##Your code here

1 - pnorm(0,-20,10)

## [1] 0.02275013

Problem 3B

Consider the latest 25 Florida polls. Assume the poll results for the difference are normal distributed with mean $\theta$ and standard deviation $\sigma$. Provide an estimate for $\theta$ and an estimate of the standard deviation $\sigma$.

##Your code here

ans <- results %>% 
        filter(state=="FL" & !is.na(diff)) %>% 
        arrange(desc(start_date)) %>% 
        slice(1:25) %>% 
        summarize(theta_hat = mean(diff), sigma_hat = sd(diff))

\[ \hat{\theta} \sim N( \theta, \sigma/ \sqrt{25})\]

Now use the Central Limit Theorem to construct a confidence interval.

##Your code here

ans$theta_hat + c(-1,1)*ans$sigma_hat/sqrt(25)*qnorm(0.975)

## [1] 13.26329 18.97671

Problem 3C

Combine these two results to provide the mean and standard deviation of a posterior distribution for $\theta$.

##Your code here

# Assuming we know the variance(using the value found from data) posterior distribution
# should be normal with the following parameters:
mu_prior <- -20
sd_prior <- 10
N <- 25
B <- (1/sd_prior^2) / (N / ans$sigma_hat^2 + 1/sd_prior^2)
sd_post <- sqrt( (1 / sd_prior^2 + N / ans$sigma_hat^2)^(-1) )
mu_post <- mu_prior * B + ans$theta_hat * (1-B)
mu_post

## [1] 15.36863

sd_post

## [1] 1.442293

Problem 3D

Use the result form Problem 3C to provide your estimate of Trump beating Rubio in Florida.

##Your code here

1 - pnorm(0, mu_post, sd_post)

## [1] 1

Problem 4

Use the poll data as well as the results from Super Tuesday (March 1st) and other election results that happen before the deadline to make predictions for each remaining primary. Then use these results to estimate the probability of Trump winning the republican nomination. Justify your answer with figures, statistical arguments, and Monte Carlo simulations.

It will help to learn about how delegates are assigned. Here is the manual

Homework 3: Is Donald Trump going to win the republican nomination?