This homework is due Sunday March 27, 2016 at 11:59PM EST. When complete, submit your code in an R Markdown file and the knitted HTML via GitHub.
Moneyball: The Art of Winning an Unfair Game is a book by Michael Lewis about the Oakland Athletics baseball team and its general manager, the person tasked with building the team, Billy Beane. During Billy Bean’s tenure as general manager, ownership cut the budget drastically leaving Billy Bean with one of the lowest payrolls in baseball. Money Ball tells the story of how Billy Bean used analysts to find inefficiencies in the market. Specifically, his team used data science to find low cost players that the data predicted would help the team win. In this lab we will go back to 2002 and try to build a baseball team with a limited budget of $50,000,000. Note that in contrast to that Oakland A’s, the Yankees had a budget of more than double: $125,000,000
We will use the Lahman
library as well as the usual dplyr
and ggplot2
. We also introduce the package broom
.
library(Lahman)
library(dplyr)
library(ggplot2)
library(broom)
You can see tables that are available when you load this package by typing
?Lahman
Statistics have been used in baseball since its beginnings. Note that Lahman
goes back to the 19th century. Batting average, for example, has been used to summarize a batter’s success for decades. Other statistics such as home runs, runs batted in (RBI) and stolen bases have been reported and players rewarded for high numbers. However, until Bill James introduced sabermetrics, careful analyses had not been done to determine if these statistics actually help a team win. To simplify the exercise we will focus on scoring runs and ignore pitching and fielding.
Use the data in the Team
table to explore the relationship between stolen bases and runs per game in 1999. Make a plot, fit a regression line, and report the coefficients. If you take the coefficient at face value, how many more runs per game does a team score for every extra SB per game?
theme_set(theme_bw())
fit <- Teams %>%
filter(yearID == 1999) %>%
mutate(R = R / G, SB = SB / G) %>%
lm(R ~ SB, data = .)
Teams %>%
filter(yearID == 1999) %>%
mutate(R = R / G, SB = SB / G) %>%
ggplot(aes(SB, R)) +
geom_point() +
geom_abline(intercept = fit$coef[1],
slope = fit$coef[2])
print(fit$coefficients[2])
## SB
## 0.4294013
In Problem 1A we observed a positive relationship between scoring runs and stealing bases. However, the estimated coefficient is a random variable. Their is chance involved in scoring run. So how do we know if this observed relationship was not just chance variability?
To examine the variability of this random variable we will consider each year to be a new independent outcome. Use the lm
and do
functions to fit a linear model to each year since 1961 (when they started playing 162 games per year). Hint: use the function tidy
in broom
to process the regression in each group so that it can be recombined (see here for examples).
Using this approach what is your estimate of the random variable’s standard error? Is the distribution of the random variable well approximated by a normal distribution? If so, use this to provide a 95% confidence interval for our effect of stolen bases on runs per game. Do you think stolen bases help score runs?
res <- Teams %>% filter(yearID >= 1961) %>%
mutate(R = R / G, SB = SB / G) %>%
group_by(yearID) %>%
do(tidy(lm(R ~ SB, data = .))) %>%
filter(term == "SB")
### ThE SE is:
sd(res$estimate)
## [1] 0.4125252
##The CI is
mean(res$estimate) + c(-1,1)*qnorm(0.975)*sd(res$estimate)
## [1] -0.7744817 0.8425875
qqnorm(res$estimate)
qqline(res$estimate)
Even if we didn’t have several years to examine the distribution of our estimate, there is a version of CLT that applies to regression. It turns out that with a large enough sample size, in this case the number of teams, we can construct a confidence interval. Use the function tidy
to report a confidence interval for the effect of SB on runs based exclusively on the 1999 data. What are your thoughts now on the effectiveness of recruiting players that can steal bases?
fit <- Teams %>%
filter(yearID == 1999) %>%
mutate(R = R / G, SB = SB / G) %>%
lm(R ~ SB, data = .)
# either
res <- tidy(fit, conf.int = TRUE)
res %>%
filter(term == "SB")
## term estimate std.error statistic p.value conf.low conf.high
## 1 SB 0.4294013 0.4308495 0.9966388 0.3274754 -0.4531538 1.311956
# or
res <- summary(fit)
res$coef[2,1] + c(-1,1)*qnorm(0.975)*res$coef[2,2]
## [1] -0.4150481 1.2738507
Back in 2002, bases on balls (BB) did not receive as much attention as other statistics. Repeat the above analysis we performed for SB for BB per game. Do BB have larger effect on runs than SB?
fit <- Teams %>%
filter(yearID == 1999) %>%
mutate(R = R / G, BB = BB / G) %>%
lm(R ~ BB, data = .)
summary(fit)
##
## Call:
## lm(formula = R ~ BB, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.81699 -0.23204 -0.02234 0.15707 0.80796
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0703 0.5095 6.026 1.71e-06 ***
## BB 0.5467 0.1369 3.992 0.000429 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3913 on 28 degrees of freedom
## Multiple R-squared: 0.3627, Adjusted R-squared: 0.3399
## F-statistic: 15.94 on 1 and 28 DF, p-value: 0.0004294
Association is not causation. It turns out that HR hitters also obtain many BB. We know for a fact that HRs cause runs because, by definition, they produce at least one. We can see this by simply plotting these two statistics for all players with more than 500 plate appearances (BB+AB
):
Batting %>%
filter(yearID >= 1961 & BB+AB > 500 & !is.na(HR) & !is.na(BB)) %>%
mutate(HR = factor(pmin(HR, 40))) %>%
ggplot(aes(HR, BB)) +
geom_boxplot()
So is the relationship we saw above for BB and Runs due to teams having more HRs also having more BBs? One way we can explore this is by keeping HR fixed and examining the relationship within the strata. For example, if we only look only at teams with 150 home runs, do more BB produce more runs?
We can’t perform this analysis on a single year, because there are not enough teams to obtain strata with more than one or two teams. Instead we will combine all data years since 1961.
Group data by the number of HRs and perform a regression analysis in each stratum to determine the effect of BB per game on runs per game. Use 10th, 20th, … quantiles to split the data into 10 groups. Hint: use the function cut
and quantile
to create the strata.
my_data <- Teams %>%
filter(yearID >= 1961) %>%
mutate(R = R / G, BB = BB / G, HR = HR / G) %>%
mutate(group = cut(HR, quantile(HR, prob = seq(0, 1, .1)), include.lowest = TRUE))
res <- my_data %>%
group_by(group) %>%
do(tidy(lm(R ~ BB, data = .))) %>%
filter(term == "BB")
res
## Source: local data frame [10 x 6]
## Groups: group [10]
##
## group term estimate std.error statistic p.value
## (fctr) (chr) (dbl) (dbl) (dbl) (dbl)
## 1 [0.291,0.592] BB 0.5779436 0.07607545 7.596979 3.932573e-12
## 2 (0.592,0.679] BB 0.2717674 0.08386533 3.240522 1.486360e-03
## 3 (0.679,0.759] BB 0.3249864 0.08355098 3.889678 1.536842e-04
## 4 (0.759,0.821] BB 0.3501527 0.07087733 4.940263 2.192452e-06
## 5 (0.821,0.886] BB 0.2952293 0.06880996 4.290502 3.390204e-05
## 6 (0.886,0.957] BB 0.3162944 0.07365079 4.294515 3.130540e-05
## 7 (0.957,1.02] BB 0.4855139 0.07418955 6.544236 1.281302e-09
## 8 (1.02,1.1] BB 0.4441734 0.06320091 7.027958 8.419585e-11
## 9 (1.1,1.23] BB 0.4278463 0.06086063 7.029935 7.817139e-11
## 10 (1.23,1.63] BB 0.5361635 0.06419123 8.352597 6.832191e-14
## Visual inspection seems to show the relationship is linear in each strate
my_data %>%
ggplot(aes(BB, R)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(~group)
In problem 1E we saw that the effect of BB on runs appears to be about the same in each strata. The relationship between HR and R is also, not surprisingly, linear:
Teams %>%
filter(yearID >= 1961) %>%
mutate(R = R / G, HR = HR / G) %>%
ggplot(aes(HR, R)) +
geom_point()
These two combined implies that a sensible linear model says:
\[ \mbox{Runs} = \beta_0 + \beta_{BB} \mbox{BB} + \beta_{HR}{HR} + \varepsilon \]
In this model, we adjust for HRs by including it as linear term. Note that we have already showed data that support this model. In general, simply fitting such a model does not necessarily adjust for a possible confounded. The model must be approximately correct.
We can fit this model like this:
fit <- Teams %>%
filter(yearID >= 1961) %>%
mutate(R = R / G, BB = BB / G, HR = HR / G) %>%
lm(R ~ BB + HR, data = .)
summary(fit)
##
## Call:
## lm(formula = R ~ BB + HR, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.94455 -0.23722 -0.01219 0.22386 1.25223
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.74279 0.06963 25.03 <2e-16 ***
## BB 0.39985 0.02236 17.89 <2e-16 ***
## HR 1.50620 0.04006 37.60 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3423 on 1413 degrees of freedom
## Multiple R-squared: 0.6421, Adjusted R-squared: 0.6416
## F-statistic: 1268 on 2 and 1413 DF, p-value: < 2.2e-16
Note that the summary shows a very strong HR effect but also a decent BB effect. Now what happens if we include Singles (H-X2B-X3B-HR
), Extra bases (doubles X2B
and triples X3B
), and HR per game. What does the model say about which of these characteristics should receive more weight. Fit the model to each year independently to check for consistency from year to year.
fit <- Teams %>% filter(yearID>=1961) %>%
mutate( R = R / G, BB = BB / G,
Singles = (H - X2B - X3B - HR) / G,
XB = (X2B + X3B) / G, HR = HR / G) %>%
lm(R ~ BB + Singles + XB + HR, data = .)
summary(fit)
##
## Call:
## lm(formula = R ~ BB + Singles + XB + HR, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.50778 -0.09685 -0.00255 0.09667 0.54763
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.83225 0.07326 -38.66 <2e-16 ***
## BB 0.36356 0.01001 36.31 <2e-16 ***
## Singles 0.55455 0.01040 53.30 <2e-16 ***
## XB 0.76998 0.01774 43.40 <2e-16 ***
## HR 1.39934 0.02055 68.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.153 on 1411 degrees of freedom
## Multiple R-squared: 0.9286, Adjusted R-squared: 0.9284
## F-statistic: 4589 on 4 and 1411 DF, p-value: < 2.2e-16
### You may want to check for consistency by year
library(broom)
Teams %>%
filter(yearID >= 1961) %>%
group_by(yearID) %>%
mutate(R = R / G, BB = BB / G, Singles = (H - X2B - X3B - HR) / G,
XB = (X2B + X3B) / G, HR = HR / G) %>%
do(tidy(lm(R ~ BB + Singles + XB + HR, data = .))) %>%
filter(!grepl("Intercept", term)) %>%
ggplot(aes(yearID, estimate, group = term, col = term)) +
geom_line() +
geom_point()
In Problem 1 we learned how much BB, singles, extra base hits and home runs help predict runs. Now we want to see how much these costs. Note that batting average, Hits (H) divided by at bats (AB) receive much attention while bases on balls (BB) does not. However, we saw how BB have almost the same effect on runs as singles. Therefore, it is possible that players that receive many BB and do not hit many singles may be undervalued. Before studying this specific question, we will examine if teams can use money to improve.
In general, does spending money help a teams win? Here we will compute the payroll for each team each year. This information is not directly provided. But we have the salary for each player and we also what team they played each year.
Before we get started there is some data wrangling to be done.
## We can use ifelse if you have not seen the revalue function
my_salaries <- Salaries %>%
mutate(teamID = as.character(plyr::revalue(teamID, c(SFG = "SFN", NYM = "NYN"))))
Use the mySalaries
data to compute each team’s payroll, in millions of dollars, for each team during each year. Save this into an object called payroll
. Hints: Group by team and year and then sum all salaries. As a sanity check make a plot of payroll against year with color representing teams. Payroll should be increasing with the New York Yankees (code NYA) having the largest payroll. Consider plotting salaries in the log-scale.
payroll <- my_salaries %>%
group_by(yearID, teamID) %>%
summarize(payroll = sum(salary) / 10^6) %>%
ungroup()
###Now look at the plot
library(ggplot2)
theme_set(theme_bw(base_size = 16))
payroll %>%
filter(yearID > 1987)%>%
ggplot(aes(x = yearID, y = payroll, color = teamID)) +
scale_y_log10() +
geom_line()
Now add the team’s winning percentage, wins / (wins + losses) for each year to the payroll
table. Hints: The Teams
table has wins and losses for each team for each year. The dplyr
join
functions permit you to join by two columns.
payroll <- Teams %>%
mutate(teamID = as.character(teamID), pct = W / (W + L)) %>%
select(yearID, teamID, pct) %>%
right_join(payroll, by = c("yearID", "teamID"))
Explore the relationship between payroll and winning percentage. Use data visualization to describe what you think is the relationship. Hint: Make scatter plots for, say, 9 different years and add a regression line.
years <- round(c(seq(1985, 2014, len = 9)))
payroll %>%
filter(yearID %in% years) %>%
ggplot(aes(x = log2(payroll), y = pct)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(~yearID, scales = "free")
##In general: Winning pct grows with payroll
Use the lm
function to fit a linear model to the 1999 data. Use the log-transform for the payroll data. According to this fitted model, on average, how many more wins does a team improve every time their budget doubles? Provide a 95% confidence interval.
payroll <- mutate(payroll, wins = pct*162)
fit <- payroll %>%
filter(yearID == 1999) %>%
lm(wins ~ log2(payroll), data= . )
summary(fit)
##
## Call:
## lm(formula = wins ~ log2(payroll), data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.725 -8.374 1.051 7.845 18.698
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.40 16.07 1.519 0.1400
## log2(payroll) 10.29 2.90 3.546 0.0014 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.46 on 28 degrees of freedom
## Multiple R-squared: 0.31, Adjusted R-squared: 0.2853
## F-statistic: 12.58 on 1 and 28 DF, p-value: 0.001397
tidy(fit, conf.int = TRUE)
## term estimate std.error statistic p.value conf.low
## 1 (Intercept) 24.40343 16.065999 1.518949 0.139987413 -8.506278
## 2 log2(payroll) 10.28521 2.900152 3.546437 0.001396836 4.344515
## conf.high
## 1 57.31314
## 2 16.22590
## 10.3 +/- 5.7 more wins each time we double payroll
Did the Oakland A’s outperform during the Money Ball era? Notice that after Oakland’s original success, other teams started implementing their approach. If in fact they were over-performing then they were winning more than predicted by the regression model.
Fit a linear model to the wins versus standardized data for each year. Then plot the residuals for Oakland. Make the same plot for the Boston Red Sox. (Hint: use the augment
function from broom on each linear fit object to extract the residuals- look at the documentation for augment.lm
). What do year do you think Oakland started using data science and when did other teams catch up?
library(broom)
payroll_with_resid <- payroll %>%
group_by(yearID) %>%
do(augment(lm(wins ~ log2(payroll), data = . ), data = . )) %>%
ungroup()
payroll_with_resid %>%
filter(teamID %in% c("BOS", "OAK")) %>%
ggplot(aes(x = yearID, y = .resid)) +
facet_grid(~teamID) +
geom_point() +
geom_smooth(span = 0.66) +
geom_hline(aes(yintercept = 0))
Since 2000 which team has performed the best over what was expected given their payroll?
payroll_with_resid %>%
filter(yearID >= 2000) %>%
mutate(teamID = reorder(teamID, .resid, median)) %>%
ggplot(aes(teamID, .resid, color = teamID)) +
geom_boxplot() +
geom_hline(aes(yintercept = 0)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
For each of the BB, singles, extra bases, home runs and stolen bases per game how did Oakland rank in 2002?
res <- Teams %>%
filter(yearID == 1999) %>%
mutate(BB = BB / G, Singles = (H - X2B - X3B - HR) / G,
XB = (X2B + X3B) / G, HR = HR / G, SB = SB / G) %>%
select(teamID, BB, Singles, XB, HR, SB)
# two possible solutions:
library(tidyr)
# or
res2 <- apply(-res[,-1], 2, rank)
rownames(res2) <- as.character(res[,1])
res2
## BB Singles XB HR SB
## ANA 24.0 20.0 30.0 25 25
## ARI 17.0 16.0 10.5 5 8
## ATL 14.0 24.0 13.0 11 6
## BAL 11.0 8.0 18.0 10 20
## BOS 15.0 17.0 1.0 18 29
## CHA 27.0 6.0 10.5 23 17
## CHN 20.0 26.0 26.0 14 30
## CIN 21.0 21.5 5.0 9 4
## CLE 2.0 4.0 8.0 8 7
## COL 25.0 5.0 7.0 4 27
## DET 29.0 29.0 15.0 6 19
## FLO 28.0 13.0 21.0 29 22
## HOU 3.0 19.0 20.0 20 3
## KCA 23.0 1.0 4.0 27 11
## LAN 16.0 14.5 29.0 16 2
## MIL 7.0 10.5 14.0 21 23
## MIN 26.0 10.5 19.0 30 14
## MON 30.0 25.0 2.0 22 27
## NYA 4.0 9.0 9.0 13 21
## NYN 5.0 7.0 22.0 17 5
## OAK 1.0 30.0 23.0 2 27
## PHI 8.5 12.0 6.0 24 12
## PIT 19.0 27.0 17.0 19 15
## SDN 8.5 28.0 28.0 26 1
## SEA 13.0 23.0 27.0 1 10
## SFN 6.0 18.0 16.0 15 18
## SLN 10.0 21.5 24.0 12 9
## TBA 22.0 3.0 25.0 28 24
## TEX 12.0 2.0 12.0 3 16
## TOR 18.0 14.5 3.0 7 13
##Note OAK was 1st in walks, 2nd in homeruns. Almost last in singles and SB.
res3 <- res %>%
gather(metric, value, -teamID) %>%
group_by(metric) %>%
mutate(ranking = rank(-value))
res3 %>%
filter(teamID == "OAK")
## Source: local data frame [5 x 4]
## Groups: metric [5]
##
## teamID metric value ranking
## (fctr) (chr) (dbl) (dbl)
## 1 OAK BB 4.7530864 1
## 2 OAK Singles 5.4814815 30
## 3 OAK XB 1.8950617 23
## 4 OAK HR 1.4506173 2
## 5 OAK SB 0.4320988 27
Now we are going to build a baseball team for the 2002 season. We get to pick one of each of the 9 batting positions DH, C, 1B, 2B, 3B, SS, CF and two outfielders (OF, LF, or RF). We will pick players active in 2002 but you will have to pay each player whatever their maximum salary was during their entire career. You have a total of $50 million. Show us your team and how many runs you think they will produce. Note the number of runs per games of the best teams:
Teams %>%
filter(yearID == 2012, lgID == "AL") %>%
mutate(runs_per_game = R / G) %>%
select(teamID, W, L, runs_per_game) %>%
arrange(desc(W))
## teamID W L runs_per_game
## 1 NYA 95 67 4.962963
## 2 OAK 94 68 4.401235
## 3 BAL 93 69 4.395062
## 4 TEX 93 69 4.987654
## 5 TBA 90 72 4.302469
## 6 LAA 89 73 4.734568
## 7 DET 88 74 4.481481
## 8 CHA 85 77 4.617284
## 9 SEA 75 87 3.820988
## 10 TOR 73 89 4.419753
## 11 KCA 72 90 4.172840
## 12 BOS 69 93 4.530864
## 13 CLE 68 94 4.117284
## 14 MIN 66 96 4.327160
Here is a possible answer:
Start by taking players who’s careers have ended by 2002 and started playing on 1961 or later. Then create new statistics of BB, Singles, extra bases, HR per plate appearance. A plate appearance is AB + BB
.
Like this:
res <- Batting %>%
group_by(playerID) %>%
filter(max(yearID) <= 2002 & min(yearID) > 1961) %>%
mutate(PA = AB + BB) %>%
filter(sum(PA) > 1000) %>%
summarize(BB = sum(BB) / sum(PA),
Singles = sum(H - X2B - X3B - HR) / sum(PA),
XB = sum(X2B + X3B) / sum(PA),
HR = sum(HR) / sum(PA),
year = floor(median(yearID)))
Now add the max salary for each player and notice how much each of the above statistics roughly costs. Note that BB and extra base hits seem to be under valued. In the linear model below we include a year effect.
res <- my_salaries %>%
group_by(playerID) %>%
summarize(max_salary = max(salary) / 10^6) %>%
inner_join(res, by = "playerID")
lm(log2(max_salary)~ as.factor(year) + BB + Singles + XB + HR, data = res)
##
## Call:
## lm(formula = log2(max_salary) ~ as.factor(year) + BB + Singles +
## XB + HR, data = res)
##
## Coefficients:
## (Intercept) as.factor(year)1975 as.factor(year)1976
## -13.1575 0.6328 0.5737
## as.factor(year)1977 as.factor(year)1978 as.factor(year)1979
## 1.0731 1.1846 1.6237
## as.factor(year)1980 as.factor(year)1981 as.factor(year)1982
## 1.6642 1.8159 1.8459
## as.factor(year)1983 as.factor(year)1984 as.factor(year)1985
## 2.0267 2.0670 2.0390
## as.factor(year)1986 as.factor(year)1987 as.factor(year)1988
## 2.6887 2.2730 3.1995
## as.factor(year)1989 as.factor(year)1990 as.factor(year)1991
## 2.9324 3.1231 3.3678
## as.factor(year)1992 as.factor(year)1993 as.factor(year)1994
## 3.0127 3.3433 3.3465
## as.factor(year)1995 as.factor(year)1996 as.factor(year)1997
## 2.9344 3.3973 3.2598
## as.factor(year)1998 as.factor(year)1999 as.factor(year)2000
## 2.8748 2.2517 0.5553
## BB Singles XB
## 13.8030 37.0837 33.1097
## HR
## 69.5072
Now we create a max_salary
table.
max_salary <- my_salaries %>%
group_by(playerID) %>%
summarize(salary = max(salary))
And use aggregate statistics to build a predictor of runs produced for a team based exclusively on BB, singles, extra base hits, and HR. We did this above:
fit <- Teams %>%
filter(yearID >= 1961) %>%
mutate(R = R / G, BB = BB / G,
Singles = (H - X2B - X3B - HR) / G,
XB = (X2B + X3B) / G, HR = HR / G) %>%
lm(R ~ BB + Singles + XB + HR, data = .)
Now we can create a predicted runs for each player that answers the question: what happens if this player was the only hitter on the team? We focus on players with more than 400 plate appearances that played in 2002. We also assume that each team has 6000 plate appearances in each 162 game:
pr <- Batting %>%
tbl_df %>%
filter(yearID == 2002 & AB + BB >= 400) %>%
mutate(PA = BB + AB) %>%
mutate(BB = BB / PA * 6000 / 162,
Singles = (H - X2B - X3B - HR) / PA * 6000 / 162,
XB = (X2B + X3B) / PA * 6000 / 162,
HR = HR / PA * 6000 / 162) %>%
mutate(runs_per_game = predict(fit, newdata = .)) %>%
select(playerID, BB, Singles, XB, HR, runs_per_game) %>%
left_join(max_salary, by = "playerID")
Now add other information:
pr <- Master %>% select(playerID, nameFirst, nameLast) %>%
right_join(pr, by = "playerID")
pr <- Fielding %>%
filter(yearID == 2002) %>%
filter(POS != "P") %>%
group_by(playerID) %>%
summarize(POS = POS[which.max(G)]) %>%
inner_join(pr, by = "playerID")
pr %>%
mutate(POS = ifelse(POS %in% c("OF", "LF", "RF"), "OF", POS)) %>%
ggplot(aes(salary / 10 ^ 6, runs_per_game)) +
geom_text(aes(label = nameLast)) +
facet_wrap(~POS, scales = "free")
Now we use linear programming to maximize the number of runs, given that we must pick one of each position:
library(reshape2)
# position by player matrix for constraint: pick one for each pos
constraint_matrix <- acast(pr, POS ~ playerID, fun.aggregate = length)
## Using salary as value column: use value.var to override.
npos <- nrow(constraint_matrix)
# finally, constrain on salary.
constraint_matrix <- rbind(constraint_matrix, salary = pr$salary)
# positions must be exactly 1 of each, salary <= 50M
constraint_dir <- c(rep("==", npos), "<=")
constraint_limit <- c(rep(1, npos), 50e6)
# solve
library(lpSolve)
lp_solution <- lp("max", pr$runs_per_game,
constraint_matrix, constraint_dir, constraint_limit,
all.int = TRUE) # Can't have half a player
# chosen players
pr %>%
filter(lp_solution$solution == 1)
## Source: local data frame [8 x 10]
##
## playerID POS nameFirst nameLast BB Singles XB HR
## (chr) (chr) (chr) (chr) (dbl) (dbl) (dbl) (dbl)
## 1 bellhma01 2B Mark Bellhorn 5.402716 4.265302 1.990474 1.919386
## 2 bondsba01 LF Barry Bonds 12.201886 4.313798 2.033648 2.834782
## 3 finlest01 CF Steve Finley 4.223522 5.977908 1.819363 1.624431
## 4 fullmbr01 1B Brad Fullmer 2.570901 5.141801 3.293966 1.526472
## 5 hernajo01 SS Jose Hernandez 3.337827 6.483086 1.668913 1.540535
## 6 millake01 OF Kevin Millar 3.099334 5.966217 3.176817 1.239733
## 7 perryhe01 3B Herbert Perry 2.601775 5.892256 1.913070 1.683502
## 8 santibe01 C Benito Santiago 1.980198 6.453979 2.126879 1.173451
## Variables not shown: runs_per_game (dbl), salary (int)
pr %>%
filter(lp_solution$solution == 1) %>%
summarize(sum(salary), mean(runs_per_game))
## Source: local data frame [1 x 2]
##
## sum(salary) mean(runs_per_game)
## (int) (dbl)
## 1 48583333 5.965083
(Adding in a designated hitter is more work but can be done).
To evaluate how much extra value we get from a player we compute residuals:
## residuals
pr <- pr %>%
do(augment(lm(runs_per_game~salary, data=.), data=.)) %>%
select(-.fitted,-.se.fit, -(.hat:.std.resid)) %>%
rename(resid = .resid)
Now we implement an ad-hoc search algorithm:
tmp <- pr %>% mutate(POS=ifelse(POS%in%c("OF","LF","RF"), "OF", POS))
Max <- 50000000
pos <- c("C", "1B","2B", "3B", "SS","CF","OF","OF")
team <- c()
payroll <- 0
N <- 0
while(length(pos)>0){
is <- order(tmp$resid, decreasing = TRUE)
FLAG <- TRUE
count <- 0
must_leave <- tmp %>% group_by(POS) %>%
summarize(min=min(salary)) %>% arrange(min) %>% slice(-1) %>%
summarize(sum(min)) %>%
unlist
if(sum(pos=="OF")>1) {
must_leave <- must_leave + tmp %>% filter(POS=="OF") %>%
arrange(salary) %>% slice(2) %>% .$salary
}
must_leave <- must_leave + min(pr$salary)
print(c(N, Max-payroll, must_leave, min(tmp$salary)))
while(FLAG){
count <- count +1
i <- is[count]
if(payroll + tmp$salary[i] + must_leave < Max ){
FLAG <- FALSE
N<-N+1
team <- c(team,paste(tmp$playerID[i]))
payroll <- payroll + tmp$salary[i]
pos <- pos[-match(tmp$POS[i],pos)]
}
}
if(!tmp$POS[i]%in%pos)
tmp <- filter(tmp, POS!=tmp$POS[i])
tmp <- filter(tmp, !playerID%in%team)
}
## sum(min)
## 0 50000000 5957500 280000
## sum(min)
## 1 28000000 5007500 280000
## sum(min)
## 2 18333333 4622500 280000
## sum(min)
## 3 7333333 4322500 300000
## sum(min)
## 4 4583333 3980000 342500
## sum(min)
## 5 4240833 3180000 800000
## sum(min)
## 6 3440833 1780000 1400000
## sum(min)
## 7 1940833 280000 1400000
### Now add a DH
tmp <- pr %>% filter(!playerID%in%team)
is <- order(tmp$runs_per_game, decreasing = TRUE)
while(FLAG){
count <- count +1
i <- is[count]
if(payroll + tmp$salary[i] <= Max ){
FLAG <- FALSE
team <- c(team,paste(tmp$playerID[i]))
payroll <- payroll + tmp$salary[i]
}
}
And here is our team:
filter(pr, playerID%in%team) %>% summarize(mean(runs_per_game))
## mean(runs_per_game)
## 1 5.712664
##compared to:
Teams %>% filter(yearID==2012 & lgID=="AL") %>%
mutate(runs_per_game=R/G) %>% select(teamID, W,L,runs_per_game) %>% arrange(desc(W))
## teamID W L runs_per_game
## 1 NYA 95 67 4.962963
## 2 OAK 94 68 4.401235
## 3 BAL 93 69 4.395062
## 4 TEX 93 69 4.987654
## 5 TBA 90 72 4.302469
## 6 LAA 89 73 4.734568
## 7 DET 88 74 4.481481
## 8 CHA 85 77 4.617284
## 9 SEA 75 87 3.820988
## 10 TOR 73 89 4.419753
## 11 KCA 72 90 4.172840
## 12 BOS 69 93 4.530864
## 13 CLE 68 94 4.117284
## 14 MIN 66 96 4.327160