Scraping NBA Data in Python

Scraping NBA data from Basketball Reference in Python.

Scraping Basketball Reference

This is the third post in a short series introducing basketball data analysis in Python. In the first post we created a Python program to compute Pythagorean wins from hard coded team data. The second post demonstrated how to load the team data from a CSV file stored on GitHub.

In this third post we’ll automate the data collection from Basketball Reference via web scraping. Rather than performing manual data entry, a Python script can programmatically scrape a website directly.

Begin by exploring the HTML structure of the NBA 2022 web page on Basketball Reference to identify the elements on the web page. On a side note, Chrome DevTools or other developer tools can be used to inspect the HTML document.

For instance, in its current form at the time of this writing the total team stats is contained within a table element with the id totals-team. The opponent stats and advanced stats (team wins/losses) are similarly within a table element. The screenshot below is a snippet of the HTML to be scraped for total stats. The snippet includes the table element as well as the first row of columns containing the data points.

Scraping NBA Data in Python

After exploring the HTML elements to better understand the structure for scraping, the first step in the Python program is to fetch the NBA 2022 web page HTML from Basketball Reference.

html_url = 'https://www.basketball-reference.com/leagues/NBA_2022.html'
result = requests.get(html_url).content

Next, create a BeautifulSoup object with the lxml parser to represent the HTML document as a nested data structure.

soup_instance = BeautifulSoup(nba_html, features='lxml')

Next, the BeautifulSoup object is used to scrape the team name, team wins/losses, points scored, and points allowed by looping over each table of row data (excluding erroneous beginning and end rows). If a row unexpectedly does not include a certain attribute, an AttributeError exception is thrown, caught, and ignored.

advanced_stats_soup = soup_instance.find(name='table', attrs={'id': 'advanced-team'})

team_items = []
for row in advanced_stats_soup.find_all('tr')[2:-1]:
    try:
        team_items.append({
            'team_name': row.find('td', {'data-stat': 'team'}).text,
            'team_wins': row.find('td', {'data-stat': 'wins'}).text,
            'team_losses': row.find('td', {'data-stat': 'losses'}).text
        })
    except AttributeError:
        pass
team_totals_soup = soup_instance.find(name='table', attrs={'id': 'totals-team'})

team_items = []
for row in team_totals_soup.find_all('tr')[1:-1]:
    try:
        team_items.append({
            'team_name': row.find('td', {'data-stat': 'team'}).text,
            'points_scored': row.find('td', {'data-stat': 'pts'}).text
        })
    except AttributeError:
        pass
opponent_totals_soup = soup_instance.find(name='table', attrs={'id': 'totals-opponent'})

team_items = []
for row in opponent_totals_soup.find_all('tr')[1:-1]:
    try:
        team_items.append({
            'team_name': row.find('td', {'data-stat': 'team'}).text,
            'points_allowed': row.find('td', {'data-stat': 'opp_pts'}).text
        })
    except AttributeError:
        pass

Lastly, combine all the scraped data into a list of dictionaries to use for the Pythagorean wins computation and summary output.

team_data = {}
for team_item in chain(team_records, team_totals, opponent_totals):
    team_name = team_item.get('team_name')
    if team_name in team_data:
        team_data[team_name].update(team_item)
    else:
        team_data[team_name] = team_item

result = dict(sorted(team_data.items()))

Note: The Basketball Reference HTML elements could potentially, albeit infrequently, change requiring an update to the Python scraping. Also, a reminder to be ethical when scraping and using scraped data.

Live Example

The application source code can be found at github.com/kyleaclark/nba-pythaogrean-wins. There are a few minor changes needed in order to run the live example within the browser. Namely, the data needs to be fetched from GitHub using Pyodide to circumvent a few technical limitations running in the browser. These limitations do not exist when running the Python application locally or from a server — in other words data can be scraped dynamically off Basketball Reference.

Run the code and wait a few seconds for the HTML to be loaded and processed before the output is displayed. The example demonstrates how NBA data can be fetched, processed, and analyzed with full automation.

Loading...
Published