This study is part of a serie of statistical analysis in the composition and salary earned by main and key players in the NBA.
I am using Beautiful Soup for the this Python app. Beautiful Soup is a Python library for parsing data out of HTML and XML files (aka webpages). It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.
The data I used came from Atlanta Hawks Roster. Reference: https://www.espn.com/nba/team/roster/_/name/atl/atlanta-hawks
- Import Modules
- Get the URL link
- Navigate the URL Data Structure
- Testing out data requests
- Write data to a file in pseudo-code:
- Open up a file to write in and append data.
- Write headers
- Run for loop that will make it clean the HTML tags and add their values in an array results
- Run for loop that will write elements of the array to file
- When complete, close the file
- The output file in CSV format.
- To access all of the content from the source code of the webpage with Python
- Parse and extract data.
- Save the info in CSV file for further analysis.
Name, POS ,Age ,HT ,WT ,College and Salary of Team Roster
If you don't have Beautiful Soup, install with 'conda install beautifulsoup' in terminal.
Python requires us to explicitly load the libraries that we want to use:
import requests
import bs4
import re
import pandas as pdLoad a webpage into python so that we can parse it and manipulate it.
URL = 'https://www.espn.com/nba/team/roster/_/name/atl/atlanta-hawks'Control of Connection. We just turned the website code into a Python object.
response = requests.get(URL)
soup = bs4.BeautifulSoup(response.text, "html.parser")Find all the tags with class city or number
data = soup.findAll(attrs={'class':['inline']})Open new file, make sure path to your data file is correct.
Later, I write headers
f = open('hilca_nba_team_roster.csv','w')
f.write("Name\tPos\tAge\tHT\tWT\tCollege\tSalary" + "\n")Clear HTML tag and assign to the array results
results = []
for element in data:
TAG_RE = re.compile(r'<[^>]+>')
text = TAG_RE.sub('', str(element))
results.append(text)i = 0
j = 0
for item in results:
if not item:
i = 0
j = j + 1
if j > 1: f.write("\n")
else:
i = i + 1
if (i == 1): f.write(item + "\t") # write name and add tabulator
if (i == 2): f.write(item + "\t") # write pos and add tabulator
if (i == 3): f.write(item + "\t") # write age and add tabulator
if (i == 4): f.write(item + "\t") # write ht and add tabulator
if (i == 5): f.write(item + "\t") # write wt and add tabulator
if (i == 6): f.write(item + "\t") # write college and add tabulator
if (i == 7): f.write(item) # write salary and add tabulatorf.close() # close fileWe used Beautiful Soup as the main tool. The major concept with Beautiful Soup is that it allows you to access elements of your page by following the CSS structures, such as grabbing all links, all headers, specific classes, or more. It is a powerful library.
Once we grab elements, Python makes it easy to write the elements or relevant components of the elements into other files, such as a CSV, that can be stored in a database or opened in other software.



