Overview

Column

Purpose

MLB Salaries vs. Offensive Statistics 2022 Season

The purpose of this study is to analyze Major League Baseball salaries based on offensive player statistics from the 2022 season. This study also encapsulates team and positional summaries. The data set I will be using contains $418$ players measured over $35$ variables.

The benefit of studying such a relationship is that it provides insight on salary projection/forecasting, performance projection/forecasting, justification on how or why a player/team receives their corresponding salaries, and lastly it provides a more well-rounded understanding of the game itself.

$\mathbf{Note:}$ This data set contains roughly half of the player pool from the MLB. Individual player summaries were not studied.

Research Questions

The questions this analysis aimed to answer are:

• Which position in baseball is making the most money?

• Which team in baseball has the highest average salary among its players?

• Which offensive statistics best contribute to a player’s salary?

Column

Abstract

From this analysis, I found that HR, Age, and BB (Walks) have the most explanatory power on a player’s individual salary. The corresponding $R^2 = .3689$. Coefficient estimates for HR, Age, and BB are: $\$236,228$, $\$895,247$, and $\$59,629$ respectively.

Further, it was found that the Third Base position on average has the highest paying salary at roughly $\$9,047,746$. Lastly, the LA Dodgers have the largest average salary at $\$11,479,902$.

Methods

The methods used in this analysis are:

• Visual exploratory analysis to find one-dimensional variable correlation with salary.

• AIC variable selection to validate our graphical findings.

• OLS regression to provide context to the impact each variable has on salary.

Inferences will be made from these methods, but it is understood this data set is incomplete.

Data Introduction

Row

Data Table

Row

Variable Classification

Here is each variable and observation from this data set. The variable classification is as follows:

Age: The age of the player.

Team: The team the player is on.

Lg: The league the team/player is in (American or National).

G: Games played.

PA: Plate appearances. Anytime you step into the batters box, you receive an additional plate appearance.

AB: At bats. Different from PA. An “at-bat” is given when a player reaches a base on a hit, error, or fielders choice.

R: Runs scored. An additional Run is given when a player crosses home plate.

H: Hits. A player is only awarded a hit if the outcome was not a fielders choice or an error.

2B, 3B, HR: Total number of doubles, triples, and home runs, respectively.

SB: Stolen bases.

BB: Base on balls or walks.

SO: Strike outs.

BA: Batting average, measured by the number of hits divided by the number of at bats (AB).

OBP: On base percentage. How often do they get on base.

SLG: Slugging. Measure if batting productivity or efficiency. Calculated by total number of bases reached divided by number of at bats.

OPS: Sum of OBP and SLG. Gives measure for overall hitting ability.

OPS+: Normalized OPS. Takes into consideration outside factors of the game such as the stadium being played in or the league you are in. OPS+ of 100 is the average.

TB: Total bases gained by a batter. Sum of all bases reached from singles, doubles, triples, etc.

GDP: Ground into double play.

HBP: Hit by pitch.

SH: Sacrifice hit. An additional SH is given when a player successfully advances one or more base runners by hitting the ball into an out. Mostly bunts.

SF: Sacrifice fly. An additional SF is given when a player advances one or more base runners by hitting the ball into an out. Typically high fly balls to the outfield.

IBB: Intentional Base on Balls or Intentional Walk.

Position: Position this player plays.

Salary: Salary earned in the 2022 season.

The final 4 variables are ones created for the sake of this analysis. Each player was considered on their Age, RBIs, Walks, and Home runs. Their corresponding classifications (factor) are listed.

$\mathbf{Note:}$ This data set is very responsive to changes in salary. The nature of MLB salaries ranges greatly and is sometimes difficult to measure. To remedy this, the log(Salary) will be studied in appropriate cases. The magnitude of difference is preserved but this allows us to visualize the data more efficiently.

Exploratory Analysis

Column

Boxplots: Salary vs Age Group ~ Walks

Scatter Plot: log(Salary) vs. HR

Column

Analysis

By inspection of the boxplot matrix, we can see clear positive relationships from Walks and Age with respect to salary. At almost every classification of Age, we can see an upward trending median, interquartile range, and overall spread of data. Further, we can see a similar upward trend at each classification of number of Walks.

Similarly, from the scatter plot there is a clear positive trend in log(Salary) and corresponding Home runs. This plot also gives insight to a players position. Using underlying theory of the game, we can infer that perhaps most of the observations along the line $x = 0$ (No home runs) are pitchers. Another explanation is a lack of playing time. Less playing time directly translates to less ABs, hence, less home runs.

Takeaways

The relationship with these variables is clear. As walks, home runs, and age increase, its almost certain you can expect your salary to increase.

But, which variables provide the biggest impact on salary? Specifically, do any one of these variables do an exceptionally better job at forecasting salary than the others?

Variable Selection

Row

AIC Algorithm/Linear Model

Salary vs Age, HR, and BB

log(Salary) vs. Age, HR, and BB

Row

Analysis

AIC (Akaike Information Criterion): AIC is a variable selection algorithm. This algorithm operates in either forward selection, backward selection, or both. The charm in this algorithm is that it provides a “best bang for your buck” in terms of variable selection. Over fitting a model to your specific data set is dangerous in regression, AIC helps with this.

How it works: By providing AIC a scope of your model, the algorithm chooses variables in a fashion (forward, backward or both) that provides sufficient insight on your dependent variable. The overall goal is to minimize the AIC metric.

$\mathbf{Note:}$ AIC is reactive to model scope changes. Adding new or different variables to your scope will change your outcome.

So, after running this algorithm on a subset of offensive statistics, the suggested model is shown. How does this model do?

The OLS regression output is shown below.

From this data set, OLS found that all of our variables are statistically significant at the $0.001\%$ level. Interpretation is as follows: With each additional increase in HR, Age, and BB, on average, all else equal, you can expect a salary increase of $\$236,228$, $\$895,247$, and $\$59,629$ respectively. Also shown is the $\mathbf{R^2}$ value. This is a measure of how much variation in salary is explained by our dependent variables HR, Age, and BB. The corresponding $\mathbf{R^2}$ was measured to be $0.3689$. Given the wide range of salary and overall low model complexity, this is pretty good.

To get further insight on the relationships at play, scatter plot Matrices are shown. Make note of the linear relationship between Home runs and Walks. On the surface, this may seem random, but using underlying theory of the game, this is expected.

Example: Aaron Judge, home run leader from the 2022 season. Pitchers will be more reluctant to throw “hittable” pitches to a batter with home run power, so they opt for “junk” or just decide to intentionally walk the batter to avoid the situation altogether.

Position Summary

Column

Position and Avg. Salary

Avg. Salary vs. Position

Avg. Homeruns vs. Position

Avg. Walks vs. Position

Column

Analysis

Graphically shown are salary averages, home run averages, and BB averages based on position.

Taking these statistics and displaying them in increasing order, we are able to more clearly find connections between offensive output and position. Shuffling through the plots, we can see familiar faces at the top of our distribution.

From this data set, it appears that the First basemen (1B) and third basemen (3B) have the largest average of home runs, walks, and salary. As these offensive statistics increase, we would expect the salary to also increase. This further supports the claim made from our linear model.

On the other hand, we can see that pitchers and catchers are often near the bottom of this distribution. From personal understanding of the game, this seems incorrect.

Possible elements of this are explained by the lack of observations from this data set. Another explanation, pitchers and catchers are the two positions mostly associated with defensive efficiency. This data set captures just the offensive aspects. So, perhaps with the presence of defensive statistics, a more accurate understanding of salary can be shown.

Team Summary

Column

Avg Salary vs. Team

Avg. Walks vs. Team

Avg. Homeruns vs. Team

Column

Analysis

Before beginning this analysis, keep in mind in the 2022 post-season, the final four teams were: New York Yankees (NYY), San Diego Padres (SDP), Houston Astros (HOU), and the Philadelphia Phillies (PHI). These were the “four best teams” from last season.

By inspection of the output, there is a lot of variation in the team summaries. This is to be expected given how fluid and widely ranging salaries are and how unpredictable sports can be. But upon further inspection, it does not track that the most offensively sound teams are not always winning, sports in a nutshell.

It is interesting to see a different team at the top of each distribution. The Dodger’s have the biggest checkbook, but don’t even scrape the top 5 in any other category. On the other hand, you have the Brewer’s who are offensively productive ($\#2$ in Avg. Walks & HRs), but couldn’t find a route to the post season.

Author’s Hometown: Cincinnati, OH

Moving on.

Interactive Map

MLB Teams with Summary Statistics

Conclusion

Column

Results

From this analysis, it was found that Walks, Home runs, and Age have the most explanatory power on a players Salary. Further, with respect to position, the third base, on average, has the largest salary associated. Lastly, the LA Dodger’s have the largest average salary distribution among it’s players in the organization.

Limitations

Limitations of this study:

Any and all speculation of the use of PEDS (Performance Enhancing Drugs/Substance) is not considered.

One could include a full data set of all active MLB players to get a more complete understanding of the salaries. Similarly, a complete data set with offensive and defensive statistics can better accommodate players that are still vital to a team’s success, just on a defensive level.

Work to remedy the presence of collinearity in the model (HR vs BB).

Lastly, this analysis took the assumption that salary projection is linear. Theory suggests that Age is a non-linear estimate with respect to salary (parabolic). I.e., your efficiency has a “cap” or at some certain Age, your productivity becomes marginal in return. This could better capture the variability in salary.

References

https://databases.usatoday.com/mlb-salaries-2022/

https://www.kaggle.com/search?q=mlb+statistics

---
title: "EDA: MLB Salary"
output: 
  flexdashboard::flex_dashboard:
    theme:
      version: 4
      bootswatch: default
      navbar-bg: "#3b5998"
    orientation: columns
    vertical_layout: fill
    source_code: embed
---

<style>
.chart-title 
  {  
    /* chart_title  */
    font-size: 18px;
    font-family: Arial;
  }
body
  {
      /* Normal  */
      font-size: 16px;
  }
</style>

```{css color tabs}
/* Set font color of inactive tab to green */
.nav-tabs-custom .nav-tabs > li > a 
  {
    color: blue;
  } 

/* Set font color of active tab to red */
.nav-tabs-custom .nav-tabs > li.active > a 
  {
    color: purple;
  } 

/* To set color on hover */
.nav-tabs-custom .nav-tabs > li.active > a:hover 
  {
    color: black;
  }

<style type="text/css"> .sidebar 
  { 
    overflow: auto; 
  } 
</style>
```

```{r setup, include=FALSE}
library(flexdashboard)
```

```{r, data/packages}
library(pacman)
pacman::p_load(tidyverse, rvest, stringr, 
               ggplot2, forcats, writexl, 
               maps, viridis, scales, 
               plotly, GGally)

mlb_salary <- read_csv("C:\\Users\\altos\\Documents\\Datasets\\mlb_salary.csv")
mlb_batting <- read_delim("C:\\Users\\altos\\Documents\\Datasets\\mlbb.txt", delim = ";")
mlb_map <- read_csv("C:\\Users\\altos\\Documents\\Datasets\\map_mlb.csv")

colnames(mlb_map)[1] <- "Tm"

# Getting rid of observations where the player was traded, taking most recent trade
mlb_batting <- mlb_batting[!duplicated(mlb_batting$Name),]
mlb_batting <- mlb_batting[!mlb_batting$Tm == "TOT",]

# Cleaning name column
mlb_batting$Name <- gsub("\xa0", " ", mlb_batting$Name)
mlb_batting$Name <- gsub("<e9>", "e", mlb_batting$Name)
mlb_batting$Name <- gsub("#", "", mlb_batting$Name)
mlb_batting$Name <- gsub("<c1>", "A", mlb_batting$Name)
mlb_batting$Name <- gsub("<f3>", "o", mlb_batting$Name)
mlb_batting$Name <- gsub("<e1>", "a", mlb_batting$Name)
mlb_batting$Name <- gsub("<f1>", "n", mlb_batting$Name)
mlb_batting$Name <- gsub("<ed>", "i", mlb_batting$Name)
mlb_batting$Name <- stringr::str_replace(mlb_batting$Name, '\\*', '')

mlb_batting <- mlb_batting %>%
  arrange(Name)

allnames <- unlist(str_split(mlb_salary$Player, ", "))
mlb_salary$First <- allnames[1:1942%%2==0]
mlb_salary$Last <- allnames[1:1942%%2==1]
mlb_salary$Name <- paste(mlb_salary$First, mlb_salary$Last)
mlb_salary$Name <- str_remove_all(mlb_salary$Name, "\\.")
mlb_batting$Name <- str_remove_all(mlb_batting$Name, "\\.")

mlb_salary <- mlb_salary %>%
  arrange(Name)

mlb_salary$Name[which(mlb_salary$Name=="A Minter.J.")] <- "Alex Minter"
mlb_salary$Name[which(mlb_salary$Name=="A Puk.J.")] <- "Andrew Puk"
mlb_salary$Name[which(mlb_salary$Name=="D'Travis Arnaud")] <- "Travis d'Arnaud"
mlb_salary$Name[which(mlb_salary$Name=="Abraham Toro-Hernandez")] <- "Abraham Toro"
mlb_salary$Name[which(mlb_salary$Name=="AJ Pollock IV")] <- "AJ Pollock"
mlb_salary$Name[which(mlb_salary$Name=="Alexander Colome")] <- "Alex Colome"
mlb_salary$Name[which(mlb_salary$Name=="Mullins Cedric II")] <- "Cedric Mullins"
mlb_salary$Name[which(mlb_salary$Name=="Christopher Martin")] <- "Chris Martin"
mlb_salary$Name[which(mlb_salary$Name=="Jazz Chisholm")] <- "Jazz Chisholm Jr"
mlb_salary$Name[which(mlb_salary$Name=="LaMonte Wade")] <- "LaMonte Wade Jr"
mlb_salary$Name[which(mlb_salary$Name=="Lourdes Gurriel")] <- "Lourdes Gurriel Jr"
mlb_salary$Name[which(mlb_salary$Name=="Vladimir Guerrero")] <- "Vladimir Guerrero Jr"
mlb_batting$Name[which(mlb_batting$Name=="Michael A Taylor")] <- "Michael Taylor"

mlb_salary <- mlb_salary %>%
  select(Name, Position, Salary)

mlb_data <- mlb_batting %>%
  left_join(mlb_salary, key = c(Name))

mlb_data <- mlb_data %>%
  filter(!is.na(Salary))

mlb_data[,"Age_Gp"] <- NA

mlb_data <- mlb_data %>%
  mutate(Age_Gp = case_when(
  Age >= 20 & Age < 25 ~ "20-25",
  Age >= 25 & Age < 30 ~ "25-30", Age >= 30 & Age < 35 ~ "30-35",
  Age >= 35 & Age < 40 ~ "35-40", Age >= 40 & Age < 45 ~ "40-45",
  Age >= 45 & Age < 50 ~ "45-50"))

mlb_data$Age_Gp <- as.factor(mlb_data$Age_Gp)

mlb_data[,"RBI_Gp"] <- NA

mlb_data <- mlb_data %>%
  mutate(RBI_Gp = case_when(
  RBI >= 0 & RBI < 30 ~ "0-30",
  RBI >= 30 & RBI < 60 ~ "30-60", RBI >= 60 & RBI < 90 ~ "60-90",
  RBI >= 90 & RBI < 120 ~ "90-120", RBI >= 120 ~ "120+"))

mlb_data$RBI_Gp <- as.factor(mlb_data$RBI_Gp)

mlb_data[,"BB_Gp"] <- NA

mlb_data <- mlb_data %>%
  mutate(BB_Gp = case_when(
  BB >= 0 & BB < 25 ~ "0-25",
  BB >= 25 & BB < 50 ~ "25-50", BB >= 50 & BB < 75 ~ "50-75", 
  BB >= 75 ~ "75+"))

mlb_data$BB_Gp <- as.factor(mlb_data$BB_Gp)

mlb_data[,"HRH"] <- NA

mlb_data <- mlb_data %>%
  mutate(HRH = case_when(
    HR < 10 ~ "<10", HR >= 10 ~ "10+"))

mlb_data$HRH <- as.factor(mlb_data$HRH)

avg_p <- mlb_data %>%
  group_by(Position) %>%
  summarise(Avg_RBI = mean(RBI), Avg_Sal = mean(Salary), Avg_BB = mean(BB), Avg_HR = mean(HR))

avg_tm <- mlb_data %>%
  group_by(Tm) %>%
  summarise(Avg_Sal = mean(Salary), Avg_RBI = mean(RBI), Avg_BB = mean(BB), Avg_HR = mean(HR))

avg_p$Position <- as.factor(avg_p$Position)
avg_tm$Tm <- as.factor(avg_tm$Tm)

avg_tm <- avg_tm %>%
  left_join(mlb_map, key = c(Tm))

avg_p <- avg_p[-c(5,6,10),]
fit.mlb <- lm(Salary ~ BB + Age + HR, data = mlb_data)
```

Overview
===

Column {data-width=650}
-----------------------------------------------------------------------

### Purpose

**MLB Salaries vs. Offensive Statistics 2022 Season**

The purpose of this study is to analyze Major League Baseball salaries based on offensive player statistics from the 2022 season. This study also encapsulates team and positional summaries. The data set I will be using contains $418$ players measured over $35$ variables.

The benefit of studying such a relationship is that it provides insight on salary projection/forecasting, performance projection/forecasting, justification on how or why a player/team receives their corresponding salaries, and lastly it provides a more well-rounded understanding of the game itself.

$\mathbf{Note:}$ This data set contains roughly half of the player pool from the MLB. Individual player summaries were not studied.

### Research Questions

The questions this analysis aimed to answer are:

• Which position in baseball is making the most money?

• Which team in baseball has the highest average salary among its players?

• Which offensive statistics best contribute to a player's salary?

Column {data-height=650}
-----------------------------------------------------------------------

### Abstract

From this analysis, I found that HR, Age, and BB (Walks) have the most explanatory power on a player's individual salary. The corresponding $R^2 = .3689$. Coefficient estimates for HR, Age, and BB are: $\$236,228$, $\$895,247$, and $\$59,629$ respectively.

Further, it was found that the Third Base position on average has the highest paying salary at roughly $\$9,047,746$. Lastly, the LA Dodgers have the largest average salary at $\$11,479,902$.

### Methods

The methods used in this analysis are:

• Visual exploratory analysis to find one-dimensional variable correlation with salary.

• AIC variable selection to validate our graphical findings.

• OLS regression to provide context to the impact each variable has on salary.

Inferences will be made from these methods, but it is understood this data set is incomplete.

Data Introduction
===

Row
-----------------------------------------------------------------------

### Data Table

```{r data table}
DT::datatable(mlb_data[,2:35], rownames = FALSE, 
              options = list(columnDefs = list(list(className = 'dt-center', targets = 1:33))))
```

Row
-----------------------------------------------------------------------

### Variable Classification

Here is each variable and observation from this data set. The variable classification is as follows:

Age: The age of the player.

Team: The team the player is on.

Lg: The league the team/player is in (American or National).

G: Games played.

PA: Plate appearances. Anytime you step into the batters box, you receive an additional plate appearance.

AB: At bats. Different from PA. An "at-bat" is given when a player reaches a base on a hit, error, or fielders choice.

R: Runs scored. An additional Run is given when a player crosses home plate.

H: Hits. A player is only awarded a hit if the outcome was not a fielders choice or an error.

2B, 3B, HR: Total number of doubles, triples, and home runs, respectively.

SB: Stolen bases.

BB: Base on balls or walks.

SO: Strike outs.

BA: Batting average, measured by the number of hits divided by the number of at bats (AB).

OBP: On base percentage. How often do they get on base.

SLG: Slugging. Measure if batting productivity or efficiency. Calculated by total number of bases reached divided by number of at bats.

OPS: Sum of OBP and SLG. Gives measure for overall hitting ability.

OPS+: Normalized OPS. Takes into consideration outside factors of the game such as the stadium being played in or the league you are in. OPS+ of 100 is the average.

TB: Total bases gained by a batter. Sum of all bases reached from singles, doubles, triples, etc.

GDP: Ground into double play.

HBP: Hit by pitch.

SH: Sacrifice hit. An additional SH is given when a player successfully advances one or more base runners by hitting the ball into an out. Mostly bunts.

SF: Sacrifice fly. An additional SF is given when a player advances one or more base runners by hitting the ball into an out. Typically high fly balls to the outfield.

IBB: Intentional Base on Balls or Intentional Walk.

Position: Position this player plays.

Salary: Salary earned in the 2022 season.

The final 4 variables are ones created for the sake of this analysis. Each player was considered on their Age, RBIs, Walks, and Home runs. Their corresponding classifications (factor) are listed.

$\mathbf{Note:}$ This data set is very responsive to changes in salary. The nature of MLB salaries ranges greatly and is sometimes difficult to measure. To remedy this, the log(Salary) will be studied in appropriate cases. The magnitude of difference is preserved but this allows us to visualize the data more efficiently.

Exploratory Analysis
===

Column {.tabset data-width=650}
-----------------------------------------------------------------------

### Boxplots: Salary vs Age Group ~ Walks

```{r graphical displays}
box <- ggplot(mlb_data, aes(x = Age_Gp, y = Salary,
                            text = paste0("Median Salary: ", mean(Salary)))) + 
          geom_boxplot(fill = "#d9adfa", color = "#3b5998") +
          labs(x = "Age Group") +
          scale_y_continuous(labels = label_comma()) +
          theme_classic() +
          facet_wrap(~ BB_Gp, scales = "free_y")

font_box <- list(family = "Mono", size = 20, color = "black")

ggplotly(box, tooltip = "text") %>%
  layout(font = font_box)
```

### Scatter Plot: log(Salary) vs. HR

```{r graphical displays B}
scatter <- ggplot(mlb_data, aes(x = HR, y = log(Salary),
                     text = paste0("Player: ", Name, "\n", "Homeruns: ", HR))) + 
  geom_point(color = "#d9adfa") +
  labs(x = "Homeruns") +
  theme_classic()

font_scat <- list(family = "Mono", size = 15, color = "black")
label_scat <- list(bgcolor = "#dfe3ee", font = font_scat)

ggplotly(scatter, tooltip = "text") %>%
  style(hoverlabel = label_scat) %>%
  layout(font = font_scat)
```

Column {data-height=650}
-----------------------------------------------------------------------

### Analysis

By inspection of the boxplot matrix, we can see clear positive relationships from Walks and Age with respect to salary. At almost every classification of Age, we can see an upward trending median, interquartile range, and overall spread of data. Further, we can see a similar upward trend at each classification of number of Walks.

Similarly, from the scatter plot there is a clear positive trend in log(Salary) and corresponding Home runs. This plot also gives insight to a players position. Using underlying theory of the game, we can infer that perhaps most of the observations along the line $x = 0$ (No home runs) are pitchers. Another explanation is a lack of playing time. Less playing time directly translates to less ABs, hence, less home runs.

### Takeaways

The relationship with these variables is clear. As walks, home runs, and age increase, its almost certain you can expect your salary to increase.

But, which variables provide the biggest impact on salary? Specifically, do any one of these variables do an exceptionally better job at forecasting salary than the others?

Variable Selection
===

Row {.tabset}
-----------------------------------------------------------------------

### AIC Algorithm/Linear Model

```{r aic selection, include=FALSE}
fit.null <- lm(Salary ~ 1, data = mlb_data)
fit.AIC <- step(fit.null, scope = Salary ~ Age + HR + RBI + SB + BB + SO + BA + OPS, direction = "forward", k = 2)
```

```{r aic output}
knitr::include_graphics("C:\\R File\\MTH 209\\Final Project\\theone.png")
```

### Salary vs Age, HR, and BB

```{r matrix1}
pairs(~Salary + Age + HR + BB, data = mlb_data)
```

### log(Salary) vs. Age, HR, and BB

```{r matrix2}
pairs(~log(Salary) + Age + HR + BB, data = mlb_data)
```

Row
-----------------------------------------------------------------------

### Analysis

AIC (Akaike Information Criterion): AIC is a variable selection algorithm. This algorithm operates in either forward selection, backward selection, or both. The charm in this algorithm is that it provides a "best bang for your buck" in terms of variable selection. Over fitting a model to your specific data set is dangerous in regression, AIC helps with this.

How it works: By providing AIC a scope of your model, the algorithm chooses variables in a fashion (forward, backward or both) that provides sufficient insight on your dependent variable. The overall goal is to minimize the AIC metric. 

$\mathbf{Note:}$ AIC is reactive to model scope changes. Adding new or different variables to your scope will change your outcome.

So, after running this algorithm on a subset of offensive statistics, the suggested model is shown. How does this model do?

The OLS regression output is shown below.

From this data set, OLS found that all of our variables are statistically significant at the $0.001\%$ level.  Interpretation is as follows: With each additional increase in HR, Age, and BB, on average, all else equal, you can expect a salary increase of $\$236,228$, $\$895,247$, and $\$59,629$ respectively. Also shown is the $\mathbf{R^2}$ value. This is a measure of how much variation in salary is explained by our dependent variables HR, Age, and BB. The corresponding $\mathbf{R^2}$ was measured to be $0.3689$. Given the wide range of salary and overall low model complexity, this is pretty good.

To get further insight on the relationships at play, scatter plot Matrices are shown. Make note of the linear relationship between Home runs and Walks. On the surface, this may seem random, but using underlying theory of the game, this is expected.

Example: Aaron Judge, home run leader from the 2022 season. Pitchers will be more reluctant to throw "hittable" pitches to a batter with home run power, so they opt for "junk" or just decide to intentionally walk the batter to avoid the situation altogether. 

Position Summary
===

Column {.tabset data-width=650}
-----------------------------------------------------------------------

### Position and Avg. Salary

```{r graph (stadium)}
devtools::install_github("bdilday/GeomMLBStadiums")
library(GeomMLBStadiums)

mlb_position <- as.data.frame(rep(NA, 7))
colnames(mlb_position)[1] <- "Position"
mlb_position[,"Latitude"] <- NA  
mlb_position[,"Longitude"] <- NA   
mlb_position[,"Avg_Salary"] <- NA
mlb_position$Avg_Salary <- c(7575904, 4294942, 9047747, 
                             6954996, 3414633, 3330355, 5636252)

mlb_position$Position <- c("1st Base", "2nd Base", "3rd Base",
                           "Shortstop (SS)", "Pitcher", "Catcher",
                           "Outfield (OF)")

mlb_position$Latitude <- c(72,40,-72,-40,1,1,1)
mlb_position$Longitude <- c(85,115,85,115,70,-20,270)
                      
p <- ggplot(mlb_position, aes(x = Latitude, y = Longitude, 
                              text = paste0(Position, ":\n", "Mean Salary: $", round(Avg_Salary, digits = 2)))) +
  geom_spraychart(stadium_transform_coords = TRUE, stadium_segments = "all", stadium_ids = "reds") +
  coord_fixed() +
  theme_void() + 
  theme(axis.title.x=element_blank(), axis.text.x=element_blank(),
        axis.ticks.x=element_blank(), axis.title.y=element_blank(),
        axis.text.y =element_blank(), axis.ticks.y=element_blank(), 
        panel.grid.major = element_blank(), 
        panel.background = element_blank()) +
  geom_point(aes(size = Avg_Salary), color = "#8b9dc3")

font <- list(family = "Sherwood", size = 15, color = "black")
label_p <- list(bgcolor = "#dfe3ee", font = font )

ggplotly(p, tooltip = "text") %>%
  style(hoverlabel = label_p) %>%
  layout(font = font)
```

### Avg. Salary vs. Position

```{r bar}
p_sal <- ggplot(avg_p, aes(x = reorder(Position, Avg_Sal), y = Avg_Sal, 
                        text = paste0("Position: ", Position, "\n", "Avg. Salary: $", round(Avg_Sal, digits = 2)))) + 
  geom_bar(stat = "identity", color = "#dfe3ee", fill = "#8b9dc3") +
  scale_y_continuous(labels = label_comma()) +
  labs(y = "Avg. Salary", x = "Position") +
  theme_classic()

font_p2 <- list(family = "Sherwood", size = 15, color = "black")
label_p2 <- list(bgcolor = "#dfe3ee", font = font_p2)

ggplotly(p_sal, tooltip = "text") %>%
  style(hoverlabel = label_p2) %>%
  layout(font = font_p2)
```

### Avg. Homeruns vs. Position

```{r}
p_hr <- ggplot(avg_p, aes(x = reorder(Position, Avg_HR), y = Avg_HR, 
             text = paste0("Position: ", Position, "\n", "Average Homeruns: ", round(Avg_HR, digits = 2)))) +
  geom_histogram(stat = "identity", color = "#dfe3ee", fill = "#8b9dc3") +
  labs(y = "Avg. HR", x = "Position") +
  theme_classic()

font_p3 <- list(family = "Mono", size = 15, color = "black")
label_p3 <- list(bgcolor = "#dfe3ee", font = font_p3)

ggplotly(p_hr, tooltip = "text") %>%
  style(hoverlabel = label_p3) %>%
  layout(font = font_p3)
```

### Avg. Walks vs. Position

```{r}
p_bb <- ggplot(avg_p, aes(x = reorder(Position, Avg_BB), y = Avg_BB, 
             text = paste0("Position: ", Position, "\n", "Average Walks: ", round(Avg_BB, digits = 2)))) +
  geom_histogram(stat = "identity", color = "#dfe3ee", fill = "#8b9dc3") +
  labs(y = "Avg. BB", x = "Position") +
  theme_classic()

font_p4 <- list(family = "Mono", size = 15, color = "black")
label_p4 <- list(bgcolor = "#dfe3ee", font = font_p4)

ggplotly(p_bb, tooltip = "text") %>%
  style(hoverlabel = label_p4) %>%
  layout(font = font_p4)
```

Column {data-width=550}
-----------------------------------------------------------------------

### Analysis

Graphically shown are salary averages, home run averages, and BB averages based on position.

Taking these statistics and displaying them in increasing order, we are able to more clearly find connections between offensive output and position. Shuffling through the plots, we can see familiar faces at the top of our distribution.

From this data set, it appears that the First basemen (1B) and third basemen (3B) have the largest average of home runs, walks, and salary. As these offensive statistics increase, we would expect the salary to also increase. This further supports the claim made from our linear model.

On the other hand, we can see that pitchers and catchers are often near the bottom of this distribution. From personal understanding of the game, this seems incorrect. 

Possible elements of this are explained by the lack of observations from this data set. Another explanation, pitchers and catchers are the two positions mostly associated with defensive efficiency. This data set captures just the offensive aspects. So, perhaps with the presence of defensive statistics, a more accurate understanding of salary can be shown.

Team Summary
===

Column {.tabset data-length=650}
-----------------------------------------------------------------------

### Avg Salary vs. Team

```{r map}
tm_salary <- ggplot(avg_tm, aes(y = reorder(Tm, Avg_Sal), x = Avg_Sal, 
                  text = paste0("Team: ", Tm, "\n", "Average Salary: $", round(Avg_Sal, digits = 2)))) +
  geom_bar(stat = "identity", fill = "#8b9dc3",  color = "#dfe3ee") +
  labs(y = "Team", x = "Avg. Salary") +
  theme_classic() +
  scale_x_continuous(label = comma, limits = c(0, 12000000))

font_t1 <- list(family = "Mono", size = 15, color = "black")
label_t1 <- list(bgcolor = "#dfe3ee", font = font_t1)

ggplotly(tm_salary, tooltip = "text") %>%
  style(hoverlabel = label_t1) %>%
  layout(font = font_t1)
```

### Avg. Walks vs. Team

```{r}
tm_bb <- ggplot(avg_tm, aes(y = reorder(Tm, Avg_BB), x = Avg_BB, 
                            text = paste0("Team: ", Tm, "\n", "Average Walks: ", round(Avg_BB, digits = 2)))) +
  geom_histogram(stat = "identity", fill = "#8b9dc3",  color = "#dfe3ee") +
  labs(y = "Team", x = "Avg. # of Walks") +
  theme_classic() +
  scale_x_continuous(label = comma, limits = c(0, 48))

font_t2 <- list(family = "Mono", size = 15, color = "black")
label_t2 <- list(bgcolor = "#dfe3ee", font = font_t2)

ggplotly(tm_bb, tooltip = "text") %>%
  style(hoverlabel = label_t2) %>%
  layout(font = font_t2)
```

### Avg. Homeruns vs. Team

```{r}
tm_hr <- ggplot(avg_tm, aes(y = reorder(Tm, Avg_HR), x = Avg_HR, 
                            text = paste0("Team: ", Tm, "\n", "Average Homeruns: ", round(Avg_HR, digits = 2)))) +
  geom_histogram(stat = "identity", fill = "#8b9dc3",  color = "#dfe3ee") +
  labs(y = "Team", x = "Avg. # of Homeruns") +
  theme_classic() +
  scale_x_continuous(label = comma, limits = c(0,20))

font_t3 <- list(family = "Mono", size = 15, color = "black")
label_t3 <- list(bgcolor = "#dfe3ee", font = font_t3)

ggplotly(tm_hr, tooltip = "text") %>%
  style(hoverlabel = label_t3) %>%
  layout(font = font_t3)
```

Column {data-length=650}
-----------------------------------------------------------------------

### Analysis

Before beginning this analysis, keep in mind in the 2022 post-season, the final four teams were: New York Yankees (NYY), San Diego Padres (SDP), Houston Astros (HOU), and the Philadelphia Phillies (PHI). These were the "four best teams" from last season.

By inspection of the output, there is a lot of variation in the team summaries. This is to be expected given how fluid and widely ranging salaries are and how unpredictable sports can be. But upon further inspection, it does not track that the most offensively sound teams are not always winning, sports in a nutshell.

It is interesting to see a different team at the top of each distribution. The Dodger's have the biggest checkbook, but don't even scrape the top 5 in any other category. On the other hand, you have the Brewer's who are offensively productive ($\#2$ in Avg. Walks & HRs), but couldn't find a route to the post season.

Author's Hometown: Cincinnati, OH

Moving on.

Interactive Map
===

### MLB Teams with Summary Statistics

```{r interactive map, out.height=5}
us_map <- map_data("state") %>%
  filter(region != "district of columbia") %>%
  select(-subregion)

us_map$region <- unname(sapply(us_map$region, str_to_title))

ont <- map_data("world", region = "canada") %>%
  filter(lat <= 45, lat > 40, long >= -85, long < -75) %>%
  select(-subregion)

us_map2 <- rbind(us_map, ont)

p <- ggplot(us_map2, aes(x = long, y = lat)) +
  geom_polygon(aes(group = group), fill = "grey", colour = "white") +
  geom_point(data = mlb_map, aes(x = longitude, y = latitude,
                                   text = paste0("Team: ", Tm, "\n", "Average Salary: $", round(avg_tm$Avg_Sal, digits = 2),
                                                 "\n", "Average Homeruns: ", round(avg_tm$Avg_HR, digits = 2), "\n",
                                                 "Average Walks: ", round(avg_tm$Avg_BB, digits = 2), "\n",
                                                 "Average RBIs: ", round(avg_tm$Avg_RBI, digits = 2))), col="#8b9dc3") +
  theme_minimal() +
  theme(axis.title.x = element_blank(), axis.text.x = element_blank(),
        axis.ticks.x = element_blank(), axis.title.y = element_blank(),
        axis.text.y = element_blank(), axis.ticks.y = element_blank(),
        panel.grid.major = element_blank(),
        panel.background = element_blank())

ggplotly(p, tooltip="text")
```

Conclusion
===

Column {data-length=650}
-----------------------------------------------------------------------

### Results

From this analysis, it was found that Walks, Home runs, and Age have the most explanatory power on a players Salary. Further, with respect to position, the third base, on average, has the largest salary associated. Lastly, the LA Dodger's have the largest average salary distribution among it's players in the organization.

### Limitations

Limitations of this study:

Any and all speculation of the use of PEDS (Performance Enhancing Drugs/Substance) is not considered.

One could include a full data set of all active MLB players to get a more complete understanding of the salaries. Similarly, a complete data set with offensive and defensive statistics can better accommodate players that are still vital to a team's success, just on a defensive level.

Work to remedy the presence of collinearity in the model (HR vs BB).

Lastly, this analysis took the assumption that salary projection is linear. Theory suggests that Age is a non-linear estimate with respect to salary (parabolic). I.e., your efficiency has a "cap" or at some certain Age, your productivity becomes marginal in return. This could better capture the variability in salary.

### References

https://databases.usatoday.com/mlb-salaries-2022/

https://www.kaggle.com/search?q=mlb+statistics