By Papa-Yaw Afari
As part of this VR basketball visualization project, we need shot-level data including shot locations, outcomes, and player context. While APIs like nba_api offer this data, they're often rate-limited or blocked. This tutorial documents a reliable alternative using web scraping to pull shot data directly from Basketball-Reference.com.
There are generally two approaches to scrapping data from an online source such as Basketball Reference. You can go the general route which would involve using existing python libraries to manually scrap data through websites utilizing the requests library in python. For this section we will be focusing on script based web-scraping.
Your first thing to check is that you have all the necessary libraries installed.
"pip install requests beautifulsoup4 pandas"
"import pandas"
"import requests"
For my project we are looking at data of Jayson Tatum in a specific game in the playoffs that has a lot of cultural significance because of his excellence in that game. Hence, I will be comparing very comparable games from other players that also matched his 50 point-benchmark. So, I scraped all of the shot data from his game on April 15, 2023.
Choose a game url from BasketballReference.com:
Follow this format: https://www.basketball-reference.com/boxscores/shot-chart/YYYYMMDDTEAM.html
Open a Python File->File->Create New Python File-> and then follow the follow steps.
Script Steps:
#URL of the game you want to scrape
url = "https://www.basketball-reference.com/.....
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Find all shots (divs with tooltips)
shots = soup.find_all("div", class_="tooltip")
# Extract shots for a specific player
player_name = "Jayson Tatum"
results = []
for shot in shots:
tip = shot.get("tip")
style = shot.get("style")
Continued loop:
if player_name in tip:
x = int(style.split('left:')[1].split('px')[0].strip())
y = int(style.split('top:')[1].split('px')[0].strip())
made = "makes" in tip
description = tip.replace("<br>", " ")
results.append({
"player": player_name,
"x": x,
"y": y,
"result": "made" if made else "missed",
"description": description
})
#Save to CSV
df = pd.DataFrame(results)
df.to_csv(f"{player_name.replace(' ', '_')}_shots.csv", index=False)
For a full-season or full-playoff dataset, use the open-source shotChart project built in Scrapy. I have linked below:
Follow these steps:
cd shotChart
pip install -r requirements.txt
To define which dates to crawl, you edit a file called calendar.json, which tells the scraper the start and end dates for the range you're interested in. For example, if you want the 2023 playoffs, you might set:
"scrapy crawl basketball-reference -o shots-[XXX]-playoffs.csv -a season=[XXXX]"
The output CSV will include shot data for all players and games during the date range you specify.