This homework is due Sunday March 27, 2016 at 11:59PM EST. When complete, submit your code in an R Markdown file and the knitted HTML via GitHub.

Introduction

Moneyball: The Art of Winning an Unfair Game is a book by Michael Lewis about the Oakland Athletics baseball team and its general manager, the person tasked with building the team, Billy Beane. During Billy Bean’s tenure as general manager, ownership cut the budget drastically leaving Billy Bean with one of the lowest payrolls in baseball. Money Ball tells the story of how Billy Bean used analysts to find inefficiencies in the market. Specifically, his team used data science to find low cost players that the data predicted would help the team win. In this lab we will go back to 2002 and try to build a baseball team with a limited budget of $50,000,000. Note that in contrast to that Oakland A’s, the Yankees had a budget of more than double: $125,000,000

We will use the Lahman library as well as the usual dplyr and ggplot2. We also introduce the package broom.

library(Lahman)
library(dplyr)
library(ggplot2)
library(broom)

You can see tables that are available when you load this package by typing

?Lahman

Problem 1 (80% of grade)

Statistics have been used in baseball since its beginnings. Note that Lahman goes back to the 19th century. Batting average, for example, has been used to summarize a batter’s success for decades. Other statistics such as home runs, runs batted in (RBI) and stolen bases have been reported and players rewarded for high numbers. However, until Bill James introduced sabermetrics, careful analyses had not been done to determine if these statistics actually help a team win. To simplify the exercise we will focus on scoring runs and ignore pitching and fielding.

Problem 1A

Use the data in the Team table to explore the relationship between stolen bases and runs per game in 1999. Make a plot, fit a regression line, and report the coefficients. If you take the coefficient at face value, how many more runs per game does a team score for every extra SB per game?

theme_set(theme_bw())
fit <- Teams %>%
  filter(yearID == 1999) %>%
  mutate(R = R / G, SB = SB / G) %>%
  lm(R ~ SB, data = .) 
Teams %>%
  filter(yearID == 1999) %>%
  mutate(R = R / G, SB = SB / G) %>% 
  ggplot(aes(SB, R)) + 
  geom_point() +
  geom_abline(intercept = fit$coef[1],
              slope = fit$coef[2])

print(fit$coefficients[2])

##        SB 
## 0.4294013

Problem 1B

In Problem 1A we observed a positive relationship between scoring runs and stealing bases. However, the estimated coefficient is a random variable. Their is chance involved in scoring run. So how do we know if this observed relationship was not just chance variability?

To examine the variability of this random variable we will consider each year to be a new independent outcome. Use the lm and do functions to fit a linear model to each year since 1961 (when they started playing 162 games per year). Hint: use the function tidy in broom to process the regression in each group so that it can be recombined (see here for examples).

Using this approach what is your estimate of the random variable’s standard error? Is the distribution of the random variable well approximated by a normal distribution? If so, use this to provide a 95% confidence interval for our effect of stolen bases on runs per game. Do you think stolen bases help score runs?

res <- Teams %>% filter(yearID >= 1961) %>%
  mutate(R = R / G, SB = SB / G) %>%
  group_by(yearID) %>%
  do(tidy(lm(R ~ SB, data = .))) %>%
  filter(term == "SB")

### ThE SE is:
sd(res$estimate)

## [1] 0.4125252

##The CI is
mean(res$estimate) + c(-1,1)*qnorm(0.975)*sd(res$estimate)

## [1] -0.7744817  0.8425875

qqnorm(res$estimate)
qqline(res$estimate)

Problem 1C

Even if we didn’t have several years to examine the distribution of our estimate, there is a version of CLT that applies to regression. It turns out that with a large enough sample size, in this case the number of teams, we can construct a confidence interval. Use the function tidy to report a confidence interval for the effect of SB on runs based exclusively on the 1999 data. What are your thoughts now on the effectiveness of recruiting players that can steal bases?

fit <- Teams %>%
  filter(yearID == 1999) %>%
  mutate(R = R / G, SB = SB / G) %>%
  lm(R ~ SB, data = .) 

# either
res <- tidy(fit, conf.int = TRUE)
res %>%
  filter(term == "SB")

##   term  estimate std.error statistic   p.value   conf.low conf.high
## 1   SB 0.4294013 0.4308495 0.9966388 0.3274754 -0.4531538  1.311956

# or
res <- summary(fit)
res$coef[2,1] + c(-1,1)*qnorm(0.975)*res$coef[2,2]

## [1] -0.4150481  1.2738507

Problem 1D

Back in 2002, bases on balls (BB) did not receive as much attention as other statistics. Repeat the above analysis we performed for SB for BB per game. Do BB have larger effect on runs than SB?

fit <- Teams %>%
  filter(yearID == 1999) %>%
  mutate(R = R / G, BB = BB / G) %>%
  lm(R ~ BB, data = .)

summary(fit)

## 
## Call:
## lm(formula = R ~ BB, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.81699 -0.23204 -0.02234  0.15707  0.80796 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.0703     0.5095   6.026 1.71e-06 ***
## BB            0.5467     0.1369   3.992 0.000429 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3913 on 28 degrees of freedom
## Multiple R-squared:  0.3627, Adjusted R-squared:  0.3399 
## F-statistic: 15.94 on 1 and 28 DF,  p-value: 0.0004294

Problem 1E

Association is not causation. It turns out that HR hitters also obtain many BB. We know for a fact that HRs cause runs because, by definition, they produce at least one. We can see this by simply plotting these two statistics for all players with more than 500 plate appearances (BB+AB):

Batting %>%
  filter(yearID >= 1961 & BB+AB > 500 & !is.na(HR) & !is.na(BB)) %>% 
  mutate(HR = factor(pmin(HR, 40))) %>%
  ggplot(aes(HR, BB)) +
  geom_boxplot()

So is the relationship we saw above for BB and Runs due to teams having more HRs also having more BBs? One way we can explore this is by keeping HR fixed and examining the relationship within the strata. For example, if we only look only at teams with 150 home runs, do more BB produce more runs?

We can’t perform this analysis on a single year, because there are not enough teams to obtain strata with more than one or two teams. Instead we will combine all data years since 1961.

Group data by the number of HRs and perform a regression analysis in each stratum to determine the effect of BB per game on runs per game. Use 10th, 20th, … quantiles to split the data into 10 groups. Hint: use the function cut and quantile to create the strata.

my_data <- Teams %>%
  filter(yearID >= 1961) %>% 
  mutate(R = R / G, BB = BB / G, HR = HR / G) %>% 
  mutate(group = cut(HR, quantile(HR, prob = seq(0, 1, .1)), include.lowest = TRUE))

res <- my_data %>%
  group_by(group) %>%
  do(tidy(lm(R ~ BB, data = .))) %>%
  filter(term == "BB")

res

## Source: local data frame [10 x 6]
## Groups: group [10]
## 
##            group  term  estimate  std.error statistic      p.value
##           (fctr) (chr)     (dbl)      (dbl)     (dbl)        (dbl)
## 1  [0.291,0.592]    BB 0.5779436 0.07607545  7.596979 3.932573e-12
## 2  (0.592,0.679]    BB 0.2717674 0.08386533  3.240522 1.486360e-03
## 3  (0.679,0.759]    BB 0.3249864 0.08355098  3.889678 1.536842e-04
## 4  (0.759,0.821]    BB 0.3501527 0.07087733  4.940263 2.192452e-06
## 5  (0.821,0.886]    BB 0.2952293 0.06880996  4.290502 3.390204e-05
## 6  (0.886,0.957]    BB 0.3162944 0.07365079  4.294515 3.130540e-05
## 7   (0.957,1.02]    BB 0.4855139 0.07418955  6.544236 1.281302e-09
## 8     (1.02,1.1]    BB 0.4441734 0.06320091  7.027958 8.419585e-11
## 9     (1.1,1.23]    BB 0.4278463 0.06086063  7.029935 7.817139e-11
## 10   (1.23,1.63]    BB 0.5361635 0.06419123  8.352597 6.832191e-14

## Visual inspection seems to show the relationship is linear in each strate
my_data %>%
  ggplot(aes(BB, R)) + 
  geom_point() + 
  geom_smooth(method = "lm") + 
  facet_wrap(~group)

Problem 1F

In problem 1E we saw that the effect of BB on runs appears to be about the same in each strata. The relationship between HR and R is also, not surprisingly, linear:

Teams %>%
  filter(yearID >= 1961) %>% 
  mutate(R = R / G, HR = HR / G) %>%
  ggplot(aes(HR, R)) +
  geom_point()

These two combined implies that a sensible linear model says:

\[ \mbox{Runs} = \beta_0 + \beta_{BB} \mbox{BB} + \beta_{HR}{HR} + \varepsilon \]

In this model, we adjust for HRs by including it as linear term. Note that we have already showed data that support this model. In general, simply fitting such a model does not necessarily adjust for a possible confounded. The model must be approximately correct.

We can fit this model like this:

fit <- Teams %>%
  filter(yearID >= 1961) %>% 
  mutate(R = R / G, BB = BB / G, HR = HR / G) %>%
  lm(R ~ BB + HR, data = .)
summary(fit)

## 
## Call:
## lm(formula = R ~ BB + HR, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.94455 -0.23722 -0.01219  0.22386  1.25223 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.74279    0.06963   25.03   <2e-16 ***
## BB           0.39985    0.02236   17.89   <2e-16 ***
## HR           1.50620    0.04006   37.60   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3423 on 1413 degrees of freedom
## Multiple R-squared:  0.6421, Adjusted R-squared:  0.6416 
## F-statistic:  1268 on 2 and 1413 DF,  p-value: < 2.2e-16

Note that the summary shows a very strong HR effect but also a decent BB effect. Now what happens if we include Singles (H-X2B-X3B-HR), Extra bases (doubles X2B and triples X3B), and HR per game. What does the model say about which of these characteristics should receive more weight. Fit the model to each year independently to check for consistency from year to year.

fit <- Teams %>% filter(yearID>=1961) %>%  
  mutate( R = R / G, BB = BB / G, 
          Singles = (H - X2B - X3B - HR) / G,
          XB = (X2B + X3B) / G, HR = HR / G) %>%
  lm(R ~ BB + Singles + XB + HR, data = .)
summary(fit)

## 
## Call:
## lm(formula = R ~ BB + Singles + XB + HR, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.50778 -0.09685 -0.00255  0.09667  0.54763 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.83225    0.07326  -38.66   <2e-16 ***
## BB           0.36356    0.01001   36.31   <2e-16 ***
## Singles      0.55455    0.01040   53.30   <2e-16 ***
## XB           0.76998    0.01774   43.40   <2e-16 ***
## HR           1.39934    0.02055   68.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.153 on 1411 degrees of freedom
## Multiple R-squared:  0.9286, Adjusted R-squared:  0.9284 
## F-statistic:  4589 on 4 and 1411 DF,  p-value: < 2.2e-16

### You may want to check for consistency by year
library(broom)
Teams %>%
  filter(yearID >= 1961) %>%
  group_by(yearID) %>%
  mutate(R = R / G, BB = BB / G, Singles = (H - X2B - X3B - HR) / G,
         XB = (X2B + X3B) / G, HR = HR / G) %>%
  do(tidy(lm(R ~ BB + Singles + XB + HR, data = .))) %>%
  filter(!grepl("Intercept", term)) %>%
  ggplot(aes(yearID, estimate, group = term, col = term)) +
  geom_line() + 
  geom_point()

Problem 2 (20% of grade)

In Problem 1 we learned how much BB, singles, extra base hits and home runs help predict runs. Now we want to see how much these costs. Note that batting average, Hits (H) divided by at bats (AB) receive much attention while bases on balls (BB) does not. However, we saw how BB have almost the same effect on runs as singles. Therefore, it is possible that players that receive many BB and do not hit many singles may be undervalued. Before studying this specific question, we will examine if teams can use money to improve.

In general, does spending money help a teams win? Here we will compute the payroll for each team each year. This information is not directly provided. But we have the salary for each player and we also what team they played each year.

Before we get started there is some data wrangling to be done.

## We can use ifelse if you have not seen the revalue function
my_salaries <- Salaries %>%
  mutate(teamID = as.character(plyr::revalue(teamID, c(SFG = "SFN", NYM = "NYN"))))

Problem 2A

Use the mySalaries data to compute each team’s payroll, in millions of dollars, for each team during each year. Save this into an object called payroll. Hints: Group by team and year and then sum all salaries. As a sanity check make a plot of payroll against year with color representing teams. Payroll should be increasing with the New York Yankees (code NYA) having the largest payroll. Consider plotting salaries in the log-scale.

payroll <- my_salaries %>%
  group_by(yearID, teamID) %>%
  summarize(payroll = sum(salary) / 10^6) %>%
  ungroup()

###Now look at the plot
library(ggplot2)
theme_set(theme_bw(base_size = 16))

payroll %>%
  filter(yearID > 1987)%>%
  ggplot(aes(x = yearID, y = payroll, color = teamID)) + 
  scale_y_log10() +
  geom_line()

Problem 2B

Now add the team’s winning percentage, wins / (wins + losses) for each year to the payroll table. Hints: The Teams table has wins and losses for each team for each year. The dplyr join functions permit you to join by two columns.

payroll  <- Teams %>% 
  mutate(teamID = as.character(teamID), pct = W / (W + L)) %>% 
  select(yearID, teamID, pct) %>% 
  right_join(payroll, by = c("yearID", "teamID"))

Problem 2C

Explore the relationship between payroll and winning percentage. Use data visualization to describe what you think is the relationship. Hint: Make scatter plots for, say, 9 different years and add a regression line.

years <- round(c(seq(1985, 2014, len = 9)))
payroll %>%
  filter(yearID %in% years) %>% 
  ggplot(aes(x = log2(payroll), y = pct)) + 
  geom_point() + 
  geom_smooth(method = "lm") +
  facet_wrap(~yearID, scales = "free")

##In general: Winning pct grows with payroll

Problem 2D

Use the lm function to fit a linear model to the 1999 data. Use the log-transform for the payroll data. According to this fitted model, on average, how many more wins does a team improve every time their budget doubles? Provide a 95% confidence interval.

payroll <- mutate(payroll, wins = pct*162)
fit <- payroll %>% 
  filter(yearID == 1999) %>% 
  lm(wins ~ log2(payroll), data= . )

summary(fit)

## 
## Call:
## lm(formula = wins ~ log2(payroll), data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.725  -8.374   1.051   7.845  18.698 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)      24.40      16.07   1.519   0.1400   
## log2(payroll)    10.29       2.90   3.546   0.0014 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.46 on 28 degrees of freedom
## Multiple R-squared:   0.31,  Adjusted R-squared:  0.2853 
## F-statistic: 12.58 on 1 and 28 DF,  p-value: 0.001397

tidy(fit, conf.int = TRUE)

##            term estimate std.error statistic     p.value  conf.low
## 1   (Intercept) 24.40343 16.065999  1.518949 0.139987413 -8.506278
## 2 log2(payroll) 10.28521  2.900152  3.546437 0.001396836  4.344515
##   conf.high
## 1  57.31314
## 2  16.22590

## 10.3 +/- 5.7 more wins each time we double payroll

Problem 2E

Did the Oakland A’s outperform during the Money Ball era? Notice that after Oakland’s original success, other teams started implementing their approach. If in fact they were over-performing then they were winning more than predicted by the regression model.

Fit a linear model to the wins versus standardized data for each year. Then plot the residuals for Oakland. Make the same plot for the Boston Red Sox. (Hint: use the augment function from broom on each linear fit object to extract the residuals- look at the documentation for augment.lm). What do year do you think Oakland started using data science and when did other teams catch up?

library(broom)
payroll_with_resid <- payroll %>%
  group_by(yearID) %>%
  do(augment(lm(wins ~ log2(payroll), data = . ), data = . )) %>%
  ungroup()

payroll_with_resid %>%
  filter(teamID %in% c("BOS", "OAK")) %>% 
  ggplot(aes(x = yearID, y = .resid)) +
  facet_grid(~teamID) +
  geom_point() +
  geom_smooth(span = 0.66) +
  geom_hline(aes(yintercept = 0))

Problem 2F

Since 2000 which team has performed the best over what was expected given their payroll?

payroll_with_resid %>%
  filter(yearID >= 2000) %>%
  mutate(teamID = reorder(teamID, .resid, median)) %>%
  ggplot(aes(teamID, .resid, color = teamID)) +
  geom_boxplot() +
  geom_hline(aes(yintercept = 0)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Problem 2G

For each of the BB, singles, extra bases, home runs and stolen bases per game how did Oakland rank in 2002?

res <- Teams %>%
  filter(yearID == 1999) %>%  
  mutate(BB = BB / G, Singles = (H - X2B - X3B - HR) / G,
         XB = (X2B + X3B) / G, HR = HR / G, SB = SB / G) %>%
  select(teamID, BB, Singles, XB, HR, SB)

# two possible solutions:
library(tidyr)

# or
res2 <- apply(-res[,-1], 2, rank)
rownames(res2) <- as.character(res[,1])
res2

##       BB Singles   XB HR SB
## ANA 24.0    20.0 30.0 25 25
## ARI 17.0    16.0 10.5  5  8
## ATL 14.0    24.0 13.0 11  6
## BAL 11.0     8.0 18.0 10 20
## BOS 15.0    17.0  1.0 18 29
## CHA 27.0     6.0 10.5 23 17
## CHN 20.0    26.0 26.0 14 30
## CIN 21.0    21.5  5.0  9  4
## CLE  2.0     4.0  8.0  8  7
## COL 25.0     5.0  7.0  4 27
## DET 29.0    29.0 15.0  6 19
## FLO 28.0    13.0 21.0 29 22
## HOU  3.0    19.0 20.0 20  3
## KCA 23.0     1.0  4.0 27 11
## LAN 16.0    14.5 29.0 16  2
## MIL  7.0    10.5 14.0 21 23
## MIN 26.0    10.5 19.0 30 14
## MON 30.0    25.0  2.0 22 27
## NYA  4.0     9.0  9.0 13 21
## NYN  5.0     7.0 22.0 17  5
## OAK  1.0    30.0 23.0  2 27
## PHI  8.5    12.0  6.0 24 12
## PIT 19.0    27.0 17.0 19 15
## SDN  8.5    28.0 28.0 26  1
## SEA 13.0    23.0 27.0  1 10
## SFN  6.0    18.0 16.0 15 18
## SLN 10.0    21.5 24.0 12  9
## TBA 22.0     3.0 25.0 28 24
## TEX 12.0     2.0 12.0  3 16
## TOR 18.0    14.5  3.0  7 13

##Note OAK was 1st in walks, 2nd in homeruns. Almost last in singles and SB.

res3 <- res %>%
  gather(metric, value, -teamID) %>%
  group_by(metric) %>%
  mutate(ranking = rank(-value))

res3 %>%
  filter(teamID == "OAK")

## Source: local data frame [5 x 4]
## Groups: metric [5]
## 
##   teamID  metric     value ranking
##   (fctr)   (chr)     (dbl)   (dbl)
## 1    OAK      BB 4.7530864       1
## 2    OAK Singles 5.4814815      30
## 3    OAK      XB 1.8950617      23
## 4    OAK      HR 1.4506173       2
## 5    OAK      SB 0.4320988      27

Problem 3 (Bonus)

Now we are going to build a baseball team for the 2002 season. We get to pick one of each of the 9 batting positions DH, C, 1B, 2B, 3B, SS, CF and two outfielders (OF, LF, or RF). We will pick players active in 2002 but you will have to pay each player whatever their maximum salary was during their entire career. You have a total of $50 million. Show us your team and how many runs you think they will produce. Note the number of runs per games of the best teams:

Teams %>%
  filter(yearID == 2012, lgID == "AL") %>%
  mutate(runs_per_game = R / G) %>%
  select(teamID, W, L, runs_per_game) %>%
  arrange(desc(W))

##    teamID  W  L runs_per_game
## 1     NYA 95 67      4.962963
## 2     OAK 94 68      4.401235
## 3     BAL 93 69      4.395062
## 4     TEX 93 69      4.987654
## 5     TBA 90 72      4.302469
## 6     LAA 89 73      4.734568
## 7     DET 88 74      4.481481
## 8     CHA 85 77      4.617284
## 9     SEA 75 87      3.820988
## 10    TOR 73 89      4.419753
## 11    KCA 72 90      4.172840
## 12    BOS 69 93      4.530864
## 13    CLE 68 94      4.117284
## 14    MIN 66 96      4.327160

Here is a possible answer:

Start by taking players who’s careers have ended by 2002 and started playing on 1961 or later. Then create new statistics of BB, Singles, extra bases, HR per plate appearance. A plate appearance is AB + BB.

Like this:

res <- Batting %>%
  group_by(playerID) %>%
  filter(max(yearID) <= 2002 & min(yearID) > 1961) %>%
  mutate(PA = AB + BB) %>%
  filter(sum(PA) > 1000) %>%
  summarize(BB = sum(BB) / sum(PA),
            Singles = sum(H - X2B - X3B - HR) / sum(PA),
            XB = sum(X2B + X3B) / sum(PA),
            HR = sum(HR) / sum(PA),
            year = floor(median(yearID)))

Now add the max salary for each player and notice how much each of the above statistics roughly costs. Note that BB and extra base hits seem to be under valued. In the linear model below we include a year effect.

res <- my_salaries %>%
  group_by(playerID) %>% 
  summarize(max_salary = max(salary) / 10^6) %>% 
  inner_join(res, by = "playerID")

lm(log2(max_salary)~ as.factor(year) + BB + Singles + XB + HR, data = res)

## 
## Call:
## lm(formula = log2(max_salary) ~ as.factor(year) + BB + Singles + 
##     XB + HR, data = res)
## 
## Coefficients:
##         (Intercept)  as.factor(year)1975  as.factor(year)1976  
##            -13.1575               0.6328               0.5737  
## as.factor(year)1977  as.factor(year)1978  as.factor(year)1979  
##              1.0731               1.1846               1.6237  
## as.factor(year)1980  as.factor(year)1981  as.factor(year)1982  
##              1.6642               1.8159               1.8459  
## as.factor(year)1983  as.factor(year)1984  as.factor(year)1985  
##              2.0267               2.0670               2.0390  
## as.factor(year)1986  as.factor(year)1987  as.factor(year)1988  
##              2.6887               2.2730               3.1995  
## as.factor(year)1989  as.factor(year)1990  as.factor(year)1991  
##              2.9324               3.1231               3.3678  
## as.factor(year)1992  as.factor(year)1993  as.factor(year)1994  
##              3.0127               3.3433               3.3465  
## as.factor(year)1995  as.factor(year)1996  as.factor(year)1997  
##              2.9344               3.3973               3.2598  
## as.factor(year)1998  as.factor(year)1999  as.factor(year)2000  
##              2.8748               2.2517               0.5553  
##                  BB              Singles                   XB  
##             13.8030              37.0837              33.1097  
##                  HR  
##             69.5072

Now we create a max_salary table.

max_salary <- my_salaries %>%
  group_by(playerID) %>%
  summarize(salary = max(salary))

And use aggregate statistics to build a predictor of runs produced for a team based exclusively on BB, singles, extra base hits, and HR. We did this above:

fit <- Teams %>%
  filter(yearID >= 1961) %>%  
  mutate(R = R / G, BB = BB / G,
         Singles = (H - X2B - X3B - HR) / G,
         XB = (X2B + X3B) / G, HR = HR / G) %>%
  lm(R ~ BB + Singles + XB + HR, data = .)

Now we can create a predicted runs for each player that answers the question: what happens if this player was the only hitter on the team? We focus on players with more than 400 plate appearances that played in 2002. We also assume that each team has 6000 plate appearances in each 162 game:

pr <- Batting %>%
  tbl_df %>%
  filter(yearID == 2002 & AB + BB >= 400) %>% 
  mutate(PA = BB + AB) %>%
  mutate(BB = BB / PA * 6000 / 162,
         Singles = (H - X2B - X3B - HR) / PA * 6000 / 162,
         XB = (X2B + X3B) / PA * 6000 / 162, 
         HR = HR / PA * 6000 / 162) %>%
  mutate(runs_per_game = predict(fit, newdata = .)) %>%
  select(playerID, BB, Singles, XB, HR, runs_per_game) %>%
  left_join(max_salary, by = "playerID")

Now add other information:

pr <- Master %>% select(playerID, nameFirst, nameLast) %>%
  right_join(pr, by = "playerID")

pr <- Fielding %>%
  filter(yearID == 2002) %>% 
  filter(POS != "P") %>% 
  group_by(playerID) %>% 
  summarize(POS = POS[which.max(G)]) %>% 
  inner_join(pr, by = "playerID")

pr %>%
  mutate(POS = ifelse(POS %in% c("OF", "LF", "RF"), "OF", POS)) %>%
  ggplot(aes(salary / 10 ^ 6, runs_per_game)) + 
  geom_text(aes(label = nameLast)) + 
  facet_wrap(~POS, scales = "free")

Now we use linear programming to maximize the number of runs, given that we must pick one of each position:

library(reshape2)

# position by player matrix for constraint: pick one for each pos
constraint_matrix <- acast(pr, POS ~ playerID, fun.aggregate = length)

## Using salary as value column: use value.var to override.

npos <- nrow(constraint_matrix)
# finally, constrain on salary.
constraint_matrix <- rbind(constraint_matrix, salary = pr$salary)

# positions must be exactly 1 of each, salary <= 50M
constraint_dir <- c(rep("==", npos), "<=")
constraint_limit <- c(rep(1, npos), 50e6)

# solve 
library(lpSolve)
lp_solution <- lp("max", pr$runs_per_game,
                  constraint_matrix, constraint_dir, constraint_limit,
                  all.int = TRUE) # Can't have half a player

# chosen players
pr %>%
  filter(lp_solution$solution == 1)

## Source: local data frame [8 x 10]
## 
##    playerID   POS nameFirst  nameLast        BB  Singles       XB       HR
##       (chr) (chr)     (chr)     (chr)     (dbl)    (dbl)    (dbl)    (dbl)
## 1 bellhma01    2B      Mark  Bellhorn  5.402716 4.265302 1.990474 1.919386
## 2 bondsba01    LF     Barry     Bonds 12.201886 4.313798 2.033648 2.834782
## 3 finlest01    CF     Steve    Finley  4.223522 5.977908 1.819363 1.624431
## 4 fullmbr01    1B      Brad   Fullmer  2.570901 5.141801 3.293966 1.526472
## 5 hernajo01    SS      Jose Hernandez  3.337827 6.483086 1.668913 1.540535
## 6 millake01    OF     Kevin    Millar  3.099334 5.966217 3.176817 1.239733
## 7 perryhe01    3B   Herbert     Perry  2.601775 5.892256 1.913070 1.683502
## 8 santibe01     C    Benito  Santiago  1.980198 6.453979 2.126879 1.173451
## Variables not shown: runs_per_game (dbl), salary (int)

pr %>%
  filter(lp_solution$solution == 1) %>%
  summarize(sum(salary), mean(runs_per_game))

## Source: local data frame [1 x 2]
## 
##   sum(salary) mean(runs_per_game)
##         (int)               (dbl)
## 1    48583333            5.965083

(Adding in a designated hitter is more work but can be done).

Alternative ad-hoc solution

To evaluate how much extra value we get from a player we compute residuals:

## residuals
pr <- pr %>%
  do(augment(lm(runs_per_game~salary, data=.), data=.)) %>%
  select(-.fitted,-.se.fit, -(.hat:.std.resid)) %>%
  rename(resid = .resid)

Now we implement an ad-hoc search algorithm:

tmp <- pr %>% mutate(POS=ifelse(POS%in%c("OF","LF","RF"), "OF", POS))
Max <- 50000000
pos <- c("C", "1B","2B", "3B", "SS","CF","OF","OF")
team <- c()
payroll <- 0
N <- 0 
while(length(pos)>0){
  is <- order(tmp$resid, decreasing = TRUE)
  FLAG <- TRUE
  count <- 0
  must_leave <- tmp %>% group_by(POS) %>% 
    summarize(min=min(salary)) %>% arrange(min) %>% slice(-1) %>%
    summarize(sum(min)) %>% 
    unlist
  if(sum(pos=="OF")>1) {
    must_leave <- must_leave + tmp %>% filter(POS=="OF") %>% 
    arrange(salary) %>% slice(2) %>% .$salary
  }
  must_leave <- must_leave + min(pr$salary)
  print(c(N, Max-payroll, must_leave,  min(tmp$salary)))
  while(FLAG){
    count <- count +1
    i <- is[count]
    if(payroll + tmp$salary[i] + must_leave <  Max ){
      FLAG <- FALSE
      N<-N+1
      team <- c(team,paste(tmp$playerID[i]))
      payroll <- payroll + tmp$salary[i] 
      pos <- pos[-match(tmp$POS[i],pos)]
    }
  }
  if(!tmp$POS[i]%in%pos) 
    tmp <- filter(tmp, POS!=tmp$POS[i])
  tmp <- filter(tmp, !playerID%in%team)
}

##                   sum(min)          
##        0 50000000  5957500   280000 
##                   sum(min)          
##        1 28000000  5007500   280000 
##                   sum(min)          
##        2 18333333  4622500   280000 
##                   sum(min)          
##        3  7333333  4322500   300000 
##                   sum(min)          
##        4  4583333  3980000   342500 
##                   sum(min)          
##        5  4240833  3180000   800000 
##                   sum(min)          
##        6  3440833  1780000  1400000 
##                   sum(min)          
##        7  1940833   280000  1400000

### Now add a DH
tmp <- pr %>% filter(!playerID%in%team)
is <- order(tmp$runs_per_game, decreasing = TRUE)
while(FLAG){
  count <- count +1
  i <- is[count]
  if(payroll + tmp$salary[i] <=  Max ){
    FLAG <- FALSE
    team <- c(team,paste(tmp$playerID[i]))
    payroll <- payroll + tmp$salary[i] 
  }
}

And here is our team:

filter(pr, playerID%in%team) %>% summarize(mean(runs_per_game))

##   mean(runs_per_game)
## 1            5.712664

##compared to:
Teams %>% filter(yearID==2012 & lgID=="AL") %>% 
mutate(runs_per_game=R/G) %>% select(teamID, W,L,runs_per_game) %>% arrange(desc(W))

##    teamID  W  L runs_per_game
## 1     NYA 95 67      4.962963
## 2     OAK 94 68      4.401235
## 3     BAL 93 69      4.395062
## 4     TEX 93 69      4.987654
## 5     TBA 90 72      4.302469
## 6     LAA 89 73      4.734568
## 7     DET 88 74      4.481481
## 8     CHA 85 77      4.617284
## 9     SEA 75 87      3.820988
## 10    TOR 73 89      4.419753
## 11    KCA 72 90      4.172840
## 12    BOS 69 93      4.530864
## 13    CLE 68 94      4.117284
## 14    MIN 66 96      4.327160