This homework is due Sunday March 27, 2016 at 11:59PM EST. When complete, submit your code in an R Markdown file and the knitted HTML via GitHub.

Introduction

Moneyball: The Art of Winning an Unfair Game is a book by Michael Lewis about the Oakland Athletics baseball team and its general manager, the person tasked with building the team, Billy Beane. During Billy Bean’s tenure as general manager, ownership cut the budget drastically leaving Billy Bean with one of the lowest payrolls in baseball. Money Ball tells the story of how Billy Bean used analysts to find inefficiencies in the market. Specifically, his team used data science to find low cost players that the data predicted would help the team win. In this lab we will go back to 2002 and try to build a baseball team with a limited budget of $50,000,000. Note that in contrast to that Oakland A’s, the Yankees had a budget of more than double: $125,000,000

We will use the Lahman library as well as the usual dplyr and ggplot2. We also introduce the package broom.

library(Lahman)
library(dplyr)
library(ggplot2)
library(broom)

You can see tables that are available when you load this package by typing

?Lahman

Problem 1 (80% of grade)

Statistics have been used in baseball since its beginnings. Note that Lahman goes back to the 19th century. Batting average, for example, has been used to summarize a batter’s success for decades. Other statistics such as home runs, runs batted in (RBI) and stolen bases have been reported and players rewarded for high numbers. However, until Bill James introduced sabermetrics, careful analyses had not been done to determine if these statistics actually help a team win. To simplify the exercise we will focus on scoring runs and ignore pitching and fielding.

Problem 1A

Use the data in the Team table to explore the relationship between stolen bases and runs per game in 1999. Make a plot, fit a regression line, and report the coefficients. If you take the coefficient at face value, how many more runs per game does a team score for every extra SB per game?

## put your code here

Problem 1B

In Problem 1A we observed a positive relationship between scoring runs and stealing bases. However, the estimated coefficient is a random variable. Their is chance involved in scoring run. So how do we know if this observed relationship was not just chance variability?

To examine the variability of this random variable we will consider each year to be a new independent outcome. Use the lm and do functions to fit a linear model to each year since 1961 (when they started playing 162 games per year). Hint: use the function tidy in broom to process the regression in each group so that it can be recombined (see here for examples).

Using this approach what is your estimate of the random variable’s standard error? Is the distribution of the random variable well approximated by a normal distribution? If so, use this to provide a 95% confidence interval for our effect of stolen bases on runs per game. Do you think stolen bases help score runs?

## put your code here

Problem 1C

Even if we didn’t have several years to examine the distribution of our estimate, there is a version of CLT that applies to regression. It turns out that with a large enough sample size, in this case the number of teams, we can construct a confidence interval. Use the function tidy to report a confidence interval for the effect of SB on runs based exclusively on the 1999 data. What are your thoughts now on the effectiveness of recruiting players that can steal bases?

## put your code here

Problem 1D

Back in 2002, bases on balls (BB) did not receive as much attention as other statistics. Repeat the above analysis we performed for SB for BB per game. Do BB have larger effect on runs than SB?

## put your code here

Problem 1E

Association is not causation. It turns out that HR hitters also obtain many BB. We know for a fact that HRs cause runs because, by definition, they produce at least one. We can see this by simply plotting these two statistics for all players with more than 500 plate appearances (BB+AB):

Batting %>%
  filter(yearID >= 1961 & BB+AB > 500 & !is.na(HR) & !is.na(BB)) %>% 
  mutate(HR = factor(pmin(HR, 40))) %>%
  ggplot(aes(HR, BB)) +
  geom_boxplot()

So is the relationship we saw above for BB and Runs due to teams having more HRs also having more BBs? One way we can explore this is by keeping HR fixed and examining the relationship within the strata. For example, if we only look only at teams with 150 home runs, do more BB produce more runs?

We can’t perform this analysis on a single year, because there are not enough teams to obtain strata with more than one or two teams. Instead we will combine all data years since 1961.

Group data by the number of HRs and perform a regression analysis in each stratum to determine the effect of BB per game on runs per game. Use 10th, 20th, … quantiles to split the data into 10 groups. Hint: use the function cut and quantile to create the strata.

## put your code here

Problem 1F

In problem 1E we saw that the effect of BB on runs appears to be about the same in each strata. The relationship between HR and R is also, not surprisingly, linear:

Teams %>%
  filter(yearID >= 1961) %>% 
  mutate(R = R / G, HR = HR / G) %>%
  ggplot(aes(HR, R)) +
  geom_point()

These two combined implies that a sensible linear model says:

\[ \mbox{Runs} = \beta_0 + \beta_{BB} \mbox{BB} + \beta_{HR}{HR} + \varepsilon \]

In this model, we adjust for HRs by including it as linear term. Note that we have already showed data that support this model. In general, simply fitting such a model does not necessarily adjust for a possible confounded. The model must be approximately correct.

We can fit this model like this:

fit <- Teams %>%
  filter(yearID >= 1961) %>% 
  mutate(R = R / G, BB = BB / G, HR = HR / G) %>%
  lm(R ~ BB + HR, data = .)
summary(fit)
## 
## Call:
## lm(formula = R ~ BB + HR, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.94455 -0.23722 -0.01219  0.22386  1.25223 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.74279    0.06963   25.03   <2e-16 ***
## BB           0.39985    0.02236   17.89   <2e-16 ***
## HR           1.50620    0.04006   37.60   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3423 on 1413 degrees of freedom
## Multiple R-squared:  0.6421, Adjusted R-squared:  0.6416 
## F-statistic:  1268 on 2 and 1413 DF,  p-value: < 2.2e-16

Note that the summary shows a very strong HR effect but also a decent BB effect. Now what happens if we include Singles (H-X2B-X3B-HR), Extra bases (doubles X2B and triples X3B), and HR per game. What does the model say about which of these characteristics should receive more weight. Fit the model to each year independently to check for consistency from year to year.

## put your code here

Problem 2 (20% of grade)

In Problem 1 we learned how much BB, singles, extra base hits and home runs help predict runs. Now we want to see how much these costs. Note that batting average, Hits (H) divided by at bats (AB) receive much attention while bases on balls (BB) does not. However, we saw how BB have almost the same effect on runs as singles. Therefore, it is possible that players that receive many BB and do not hit many singles may be undervalued. Before studying this specific question, we will examine if teams can use money to improve.

In general, does spending money help a teams win? Here we will compute the payroll for each team each year. This information is not directly provided. But we have the salary for each player and we also what team they played each year.

Before we get started there is some data wrangling to be done.

## We can use ifelse if you have not seen the revalue function
my_salaries <- Salaries %>%
  mutate(teamID = as.character(plyr::revalue(teamID, c(SFG = "SFN", NYM = "NYN"))))

Problem 2A

Use the mySalaries data to compute each team’s payroll, in millions of dollars, for each team during each year. Save this into an object called payroll. Hints: Group by team and year and then sum all salaries. As a sanity check make a plot of payroll against year with color representing teams. Payroll should be increasing with the New York Yankees (code NYA) having the largest payroll. Consider plotting salaries in the log-scale.

## your code here

Problem 2B

Now add the team’s winning percentage, wins / (wins + losses) for each year to the payroll table. Hints: The Teams table has wins and losses for each team for each year. The dplyr join functions permit you to join by two columns.

## your code here

Problem 2C

Explore the relationship between payroll and winning percentage. Use data visualization to describe what you think is the relationship. Hint: Make scatter plots for, say, 9 different years and add a regression line.

## your code here

Problem 2D

Use the lm function to fit a linear model to the 1999 data. Use the log-transform for the payroll data. According to this fitted model, on average, how many more wins does a team improve every time their budget doubles? Provide a 95% confidence interval.

## put your code here

Problem 2E

Did the Oakland A’s outperform during the Money Ball era? Notice that after Oakland’s original success, other teams started implementing their approach. If in fact they were over-performing then they were winning more than predicted by the regression model.

Fit a linear model to the wins versus standardized data for each year. Then plot the residuals for Oakland. Make the same plot for the Boston Red Sox. (Hint: use the augment function from broom on each linear fit object to extract the residuals- look at the documentation for augment.lm). What do year do you think Oakland started using data science and when did other teams catch up?

## your code here

Problem 2F

Since 2000 which team has performed the best over what was expected given their payroll?

## your code here

Problem 2G

For each of the BB, singles, extra bases, home runs and stolen bases per game how did Oakland rank in 2002?

## your code here

Problem 3 (Bonus)

Now we are going to build a baseball team for the 2002 season. We get to pick one of each of the 9 batting positions DH, C, 1B, 2B, 3B, SS, CF and two outfielders (OF, LF, or RF). We will pick players active in 2002 but you will have to pay each player whatever their maximum salary was during their entire career. You have a total of $50 million. Show us your team and how many runs you think they will produce. Note the number of runs per games of the best teams:

Teams %>%
  filter(yearID == 2002, lgID == "AL") %>%
  mutate(runs_per_game = R / G) %>%
  select(teamID, W, L, runs_per_game) %>%
  arrange(desc(W))
##    teamID   W   L runs_per_game
## 1     NYA 103  58      5.571429
## 2     OAK 103  59      4.938272
## 3     ANA  99  63      5.253086
## 4     MIN  94  67      4.770186
## 5     BOS  93  69      5.302469
## 6     SEA  93  69      5.024691
## 7     CHA  81  81      5.283951
## 8     TOR  78  84      5.018519
## 9     CLE  74  88      4.561728
## 10    TEX  72  90      5.203704
## 11    BAL  67  95      4.117284
## 12    KCA  62 100      4.549383
## 13    DET  55 106      3.571429
## 14    TBA  55 106      4.180124

Some hints:

  1. You can create a table with per-plate-appearance statistics. For example here is such a table for players that started their career after 1961 and retired by after 2002. It also contains the median year in their career.
res <- Batting %>%
  group_by(playerID) %>%
  filter(max(yearID) <= 2002 & min(yearID) > 1961) %>%
  mutate(PA = AB + BB) %>%
  filter(sum(PA) > 1000) %>%
  summarize(BB = sum(BB) / sum(PA),
            Singles = sum(H - X2B - X3B - HR) / sum(PA),
            XB = sum(X2B + X3B) / sum(PA),
            HR = sum(HR) / sum(PA),
            year = floor(median(yearID)))
  1. You can add max salary and other player statistics using the dplyr join functions.

  2. You can use aggregate statistics to build a predictor of runs produced for a team based exclusively on BB, singles, extra base hits, and HR. We did this above:

fit <- Teams %>%
  filter(yearID >= 1961) %>%  
  mutate(R = R / G, BB = BB / G,
         Singles = (H - X2B - X3B - HR) / G,
         XB = (X2B + X3B) / G, HR = HR / G) %>%
  lm(R ~ BB + Singles + XB + HR, data = .)
  1. Create a predicted runs for each player that answers the question: what happens if this player was the only hitter on the team? Focus on players with more than 400 plate appearances that played in 2002. Assume that each team has 6000 plate appearances in each 162 game. From here and from hints 1 and 3 you can create a predicted runs per game statistics for each player. Then look for outliers in terms of producing more runs given their salary.

  2. The lp function in the lpSolve package may be useful for optimizing a team. But you can also do it in an ad-hoc fashion. Once you are done with your team, use the regression fit above to predict the number of runs per game your team will produce.