Web Scraping Stats & Game Scores from MLB.com Using Requests Library

Chinelo Osuji
11 min readMay 12, 2024

--

Looking for access to sports data but can’t find a free API?
Thinking about grabbing your credit card and signing up for one?

Why pay for an API subscription when you can just scrape the data yourself?

In this article, we will:

  • “Inspect” the structure of MLB.com’s web page to determine the best way to retrieve the data.
  • Configure URLs to request team/player stats & game scores from MLB.com API endpoints.
  • Automate the process to iterate and collect these stats for all game types over multiple seasons.
  • Execute the API requests.
  • Process the JSON responses.
  • Convert the datasets into pandas DataFrames.

I completed this work using a Google Colab notebook.

Before we start, let’s import the necessary modules.

# Import necessary modules
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import requests
import time
import numpy as np
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings("ignore")

Team Stats

Now, let’s retrieve all Team stats for 2024 so far. We’ll start with Hitting stats for the Regular Season.

We need to find the API endpoint to access the data. Right-click anywhere on the page and click Inspect. This may look different depending on your browser and OS. But the concept will be the same.

This page below should pop up on the side of your browser.

First, click Network at the top. Select Fetch/XHR and refresh the page so that all of the requests populate.

Now we have to go through each request under Name and click Preview to find the one that contains the data we are looking for.

Once we get to the ‘team?stitch_env=…’ request , we can see the structured view of the JSON data containing the stats we are looking for.

We can double click the ‘team?stitch_env=…’ request to get an idea of what the package looks like.

Next click Headers and copy the the Request URL.

Paste the Request URL in a cell and define it.

# Define URL for fetching MLB team stats for 2024
url = 'https://bdfed.stitch.mlbinfra.com/bdfed/stats/team?stitch_env=prod&sportId=1&gameType=R&group=hitting&order=desc&sortStat=homeRuns&stats=season&season=2024&limit=30&offset=0'

Go back to browser scroll down to Request Headers. Copy the highlighted section and define this as a dictionary.

Setting headers can help simulate requests coming from browsers. This can be help access websites that have security or browser-check mechanisms.

#Setup request headers to mimic a browser request and manage CORS.
headers = {
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br, zstd',
'Accept-Language': 'en-US,en;q=0.9',
'Cache-Control': 'no-cache',
'Origin': 'https://www.mlb.com',
'Pragma': 'no-cache',
'Priority': 'u=1, i',
'Referer': 'https://www.mlb.com/',
'Sec-Ch-Ua': '"Chromium";v="124", "Google Chrome";v="124", "Not-A.Brand";v="99"',
'Sec-Ch-Ua-Mobile': '?0',
'Sec-Ch-Ua-Platform': "Windows",
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'cross-site',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36'
}

Let’s define the request.

# Make GET request
r = requests.get(url=url, headers=headers).json()

And extract the ‘stats’ nested data package from JSON response.

# Extract stats from the response
r['stats']

Here is how the output looks.

Now let’s convert this structured data into a DataFrame and display it.

# Convert into a pandas DataFrame
mlb_team_hitting_stats = pd.DataFrame(r['stats'])

# Display dataframe
mlb_team_hitting_stats

And here’s a sample of the DataFrame below.

That seems simple right ?

Now we can go back to the MLB team stats page and repeat the same steps for the team Pitching stats.

However, so far we only retrieve stats for one year and one type (Hitting or Pitching) at a time. And we’re only getting the stats for the Regular season, excluding the Division Series, World Series, League Champion Series, etc.

Here we’ll slightly change our approach to retrieve the data for all stats of both group types, all game types and more years.

# DataFrame to store team stats
team_stats = pd.DataFrame()

I chose 1998 as the starting point because it’s the last time MLB expanded. Right before the 1998 season, the Tampa Bay “Devil Rays” (at the time) and Arizona “Diamondbacks” joined the league for $130 million.

# Set the range of years from 1998 to the current year
current_year = datetime.now().year
years = range(1998, current_year + 1)
# Define groups
group = ['hitting', 'pitching']
# Defining game types: R for regular season, A for All-Star Game
# P for post season, F for Wild Card, D for Division Series
# L for League Champion Series, W for World Series, S for Spring Training
game_types = ['R', 'A', 'P', 'F', 'D', 'L', 'W', 'S']

Here we loop through and request data for each year, group and game type and then store it in a DataFrame.

# Start timer
Record start time for performance measurement

# Process each year
for y in years:
for g in group:
for gt in game_types:
api_url = f'https://bdfed.stitch.mlbinfra.com/bdfed/stats/team?stitch_env=prod&sportId=1&gameType={gt}&group={g}&order=desc&sortStat=homeRuns&stats=season&season={y}&limit=30&offset=0'
r = requests.get(url=api_url, headers=headers).json()
team_stats_df1 = pd.DataFrame(r['stats'])
team_stats_df2 = pd.DataFrame({'group':[g for i in range(len(team_stats_df1))], 'gameType':[gt for i in range(len(team_stats_df1))]})
team_stats_df3 = pd.concat([team_stats_df2, team_stats_df1], axis=1)

team_stats = pd.concat([team_stats, team_stats_df3], ignore_index=False)

print(f'Finished scraping data for {y} season, team {g} stats, game type {gt}.')
lag = np.random.uniform(1, 3) # Random sleep to prevent hitting API rate limits
print(f'...waiting {round(lag,1)} seconds')
time.sleep(lag)

# Sort the combined DataFrame
team_stats = team_stats.sort_values(by=['year', 'group', 'gameType', 'homeRuns', 'era'], ascending=[True, True, True, False, True])

# Print the total time taken to process data
print(f'Process completed in {time.time() - begin_loop} seconds')

Here is how the output looks. The entire process took around 18 minutes.

Let’s see the last 25 rows of the DataFrame.

# Fill NaN values with 0
team_stats.fillna(0, inplace=True)

# Reset the index
team_stats = team_stats.reset_index(drop=True)

# Show the last 25 rows
team_stats.tail(25)

Here’s a sample of the Team stats.

Shape shows number of rows and columns.

# Show DatFrame size
team_stats.shape

Output: (2576, 75)

And here’s a simple way to see the names of all columns in the DataFrame instead of scrolling forever to the right.

# Show all column names
for column in team_stats.columns:
print(column)

Here’s a sample of the column names.

Player Stats

Now let’s go back to the MLB website and get the data for Player Hitting and Pitching stats using the same steps we used for the Team stats.

For this go around, we have to adjust our method of retrieving the stats because there are multiple pages to sort through.

For the Hitting stats, there are up to 7 pages max for all years, and for the Pitching stats, there are 4 pages max for all years, both displaying stats for up to 25 players. Knowing this is important for getting all stats from all pages.

Now set an empty DataFrame for the player stats.

# DataFrame to store player stats
player_stats = pd.DataFrame()

Set the time-frame.

# Set the range of years from 1998 to the current year
current_year = datetime.now().year
years = range(1998, current_year + 1)

Define the groups along with their corresponding offsets. Since each page displays stats for up to 25 players, the API might only return 25 records at a time. By using an offset, we can continue to request blocks of data starting where the last request left off.

# Define groups and offsets 
groups = {
'hitting': ['0', '25', '50', '75', '100', '150'],
'pitching': ['0', '25', '50', '75']
}

Define all game types.

# Defining game types: R for regular season, A for All-Star Game
# P for post season, F for Wild Card, D for Division Series
# L for League Champion Series, W for World Series, S for Spring Training
game_types = ['R', 'A', 'P', 'F', 'D', 'L', 'W', 'S']

Here we loop through and request data for each year, group, offset and game type and then stores it in a DataFrame.

# Record start time for performance measurement
begin_loop = time.time()

# Process each year
for y in years:
for group, offsets in groups.items():
for offset in offsets:
for game_type in game_types:
api_url = f'https://bdfed.stitch.mlbinfra.com/bdfed/stats/player?stitch_env=prod&season={y}&sportId=1&stats=season&group={group}&gameType={game_type}&limit=25&offset={offset}&sortStat=earnedRunAverage&order=asc'
response = requests.get(url=api_url, headers=headers)
try:
response.raise_for_status()
r = response.json()
player_stats_df1 = pd.DataFrame(r['stats'])
player_stats_df2 = pd.DataFrame({'group': [group for _ in range(len(player_stats_df1))], 'gameType': [game_type for _ in range(len(player_stats_df1))]})
player_stats_df3 = pd.concat([player_stats_df2, player_stats_df1], axis=1)
player_stats = pd.concat([player_stats, player_stats_df3], ignore_index=True)
print(f'Finished scraping data for {y} season, group {group}, game type {game_type}, offset {offset}.')
except requests.exceptions.HTTPError as e:
print(f'HTTP Error for {y}, group {group}, game type {game_type}, offset {offset}: {str(e)}')
except Exception as e:
print(f'Error for {y}, group {group}, game type {game_type}, offset {offset}: {str(e)}')

# Random delay to mimic human interaction and avoid being blocked
lag = np.random.uniform(1, 3)
print(f'...waiting {round(lag, 1)} seconds')
time.sleep(lag)

# Sort the combined DataFrame
player_stats = player_stats.sort_values(by=['year', 'group', 'gameType', 'homeRuns', 'era'], ascending=[True, True, True, False, True])

# Print elapsed time for the operation
print(f'Process completed in {time.time() - begin_loop} seconds')

This process took roughly an hour and a half to complete.

Again, let’s see the last 25 rows of the DataFrame.

#Fill NaN values with 0
player_stats.fillna(0, inplace=True)

# Reset the index
player_stats = player_stats.reset_index(drop=True)

# Show the last 25 rows
player_stats.tail(25)

And here it is.

Number of rows and columns.

# Show DataFrame size
player_stats.shape

Output: (24477, 116)

And to see all columns, instead of scrolling forever to the right.

# Show all column names
for column in player_stats.columns:
print(column)

Here’s a sample of the column names.

Game Scores

And last , let’s go back to the MLB website and get the data for Game Scores over the seasons.

Using the same steps as before, we can see the data that we are looking for and how it is stored.

Similar steps as before.

# Define URL
url = 'https://statsapi.mlb.com/api/v1/schedule?sportId=1&sportId=51&sportId=21&startDate=2024-05-02&endDate=2024-05-02&timeZone=America/New_York&gameType=E&&gameType=S&&gameType=R&&gameType=F&&gameType=D&&gameType=L&&gameType=W&&gameType=A&&gameType=C&language=en&leagueId=104&&leagueId=103&&leagueId=160&&leagueId=590&hydrate=team,linescore(matchup,runners),xrefId,story,flags,statusFlags,broadcasts(all),venue(location),decisions,person,probablePitcher,stats,game(content(media(epg),summary),tickets),seriesStatus(useOverride=true)&sortBy=gameDate,gameStatus,gameType'
# Setup request headers to mimic a browser request and manage CORS.
headers = {
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br, zstd',
'Accept-Language': 'en-US,en;q=0.9',
'Cache-Control': 'no-cache',
'Origin': 'https://www.mlb.com',
'Pragma': 'no-cache',
'Priority': 'u=1, i',
'Referer': 'https://www.mlb.com/',
'Sec-Ch-Ua': '"Chromium";v="124", "Google Chrome";v="124", "Not-A.Brand";v="99"',
'Sec-Ch-Ua-Mobile': '?0',
'Sec-Ch-Ua-Platform': "Windows",
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-site',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36'
}
# Make GET request
r = requests.get(url=url, headers=headers).json()
# Extract away team data from the first game on the first date
r['dates'][0]['games'][0]['teams']['away']

Here is how the output looks.

# Create DataFrame from teams data in the first game on the first date
pd.DataFrame(r['dates'][0]['games'][0]['teams'])

From looking at the DataFrame, we see that we can directly access the score and isWinner rows. We can further dissect this to access the data in the leagueRecord and probablePitcher stats nested structures.

Steps are similar to before but slightly different.

# Set the range of years from 1998 to the current year
start_year = 1998
end_year = datetime.now().year
# MLB season start and end months and days
season_start_month = 2
season_start_day = 19
season_end_month = 11
season_end_day = 22
# Defining game types: R for regular season, A for All-Star Game
# P for post season, F for Wild Card, D for Division Series
# L for League Champion Series, W for World Series, S for Spring Training
game_types = ['R', 'A', 'P', 'F', 'D', 'L', 'W', 'S']
# DataFrame to store game scores
game_scores = pd.DataFrame()

Here we loop through and request data for each date and game type and then stores it in a DataFrame. The game’s ID , season, and date are also stored, along with the scores and winner/loser of both away/home teams.

# Process each year 
for year in range(start_year, end_year + 1):
# Define the start and end dates
start_date = datetime(year, season_start_month, season_start_day)
end_date = datetime(year, season_end_month, season_end_day)

current_date = start_date
while current_date <= end_date:
# Define the batch end date
batch_end_date = min(current_date + timedelta(days=29), end_date)
start_formatted_date = current_date.strftime('%Y-%m-%d')
end_formatted_date = batch_end_date.strftime('%Y-%m-%d')

# Iterate over each game type for separate API calls
for game_type in game_types:
# API URL for each game type
api_url = f'https://statsapi.mlb.com/api/v1/schedule?sportId=1&startDate={start_formatted_date}&endDate={end_formatted_date}&timeZone=America/New_York&gameType={game_type}&language=en&leagueId=104&leagueId=103&leagueId=160&leagueId=590&hydrate=team,linescore(matchup,runners),xrefId,story,flags,statusFlags,broadcasts(all),venue(location),decisions,person,probablePitcher,stats,game(content(media(epg),summary),tickets),seriesStatus(useOverride=true)&sortBy=gameDate,gameStatus,gameType'
attempts = 0

while attempts < 5:
response = requests.get(url=api_url)
if response.status_code == 200:
data = response.json()
if 'dates' in data and len(data['dates']) > 0:
for date_info in data['dates']:
for game in date_info['games']:
df_game_scores = pd.DataFrame()
for side in ['away', 'home']:
team_data = game['teams'][side]
team_info = pd.json_normalize(team_data['team'])
team_info[f'{side}_score'] = team_data.get('score', np.nan)
team_info[f'{side}_isWinner'] = team_data.get('isWinner', np.nan)
league_record = team_data.get('leagueRecord', {})
team_info[f'{side}_leagueRecord_wins'] = league_record.get('wins', np.nan)
team_info[f'{side}_leagueRecord_losses'] = league_record.get('losses', np.nan)
team_info[f'{side}_leagueRecord_pct'] = league_record.get('pct', np.nan)

team_info.columns = [f'{side}_team_{col}' for col in team_info.columns]
pitcher_info = pd.json_normalize(team_data['probablePitcher']) if 'probablePitcher' in team_data else pd.DataFrame()
if not pitcher_info.empty:
pitcher_info.columns = [f'{side}_pitcher_{col}' for col in pitcher_info.columns]
df_side = pd.concat([team_info.reset_index(drop=True), pitcher_info.reset_index(drop=True)], axis=1)
df_game_scores = pd.concat([df_game_scores, df_side], axis=1)
df_game_scores['gamePk'] = game['gamePk']
df_game_scores['season'] = game['season']
df_game_scores['officialDate'] = game['officialDate']
game_scores = pd.concat([game_scores, df_game_scores], ignore_index=True)
print(f'Data from {start_formatted_date} to {end_formatted_date} for game type {game_type} stored successfully.')
break
else:
print(f'No games found from {start_formatted_date} to {end_formatted_date} for game type {game_type}. Skipping to next dates.')
break
else:
print(f"Failed to retrieve data for game type {game_type}: HTTP {response.status_code} - {response.reason}")
attempts += 1
time.sleep(2) # Short delay before retrying

# Advance to the next date range, the loop will keep going regardless of no data being availble
current_date = batch_end_date + timedelta(days=1)
lag = np.random.uniform(1, 3)
print(f'Waiting {round(lag, 1)} seconds before next request.')
time.sleep(lag)

print('All data has been collected and stored.')

Below we can see that only data for the current game type at the specific time is stored. And skips to next date range if there is no games found.

Let’s see the last 25 rows of the DataFrame.

# Replace all NaN values with 0
game_scores.fillna(0, inplace=True)
# Sort by officialDate and season in ascending order
game_scores = df_games.sort_values(by=['officialDate', 'season'], ascending=[True, True])
# Reset the index, dropping the old index
game_scores = df_games.reset_index(drop=True)
# Display the last 25 rows
game_scores.tail(25)

And here it is.

Number of rows and columns.

# Show DataFrame size
game_scores.shape

Output: (43016, 183)

To see all column names rather than scrolling to the right forever.

# Show all column names
for column in game_scores.columns:
print(column)

Here’s a sample of the column names.

Thanks for reading ! More to come ! Stay tuned !

--

--

Chinelo Osuji
Chinelo Osuji

Written by Chinelo Osuji

DevOps | Cloud | Data Engineer | AWS | Broward College Student

No responses yet