Canada Basketball Analytics

How to Web Scrape Team Canadas FIBA Rankings

Throughout the initial development of this website, a lot of the scraping of data on the web was done manually either by copying and pasting or leveraging some Google Chrome plugins. If you’ve ever developed anything before you know that manual tasks like these get tedious and are a time waste which can be spent working on more fun and interesting challenges. Due to our love/hate relationship with these manual tasks we built out some automated processes to web scrape FIBA rankings (more specifically Team Canada’s) and load on a recurring basis into an AWS S3 Bucket (as a CSV) and into our MySQL which this website pulls from and keeps the ranking tables updated on a monthly basis. We thought it would be good to share our knowledge around how we accomplished this!

Check out the Python packages we leveraged for this code below.

Anaconda

Anaconda is a free and open-source distribution of the Python and R programming languages for scientific computing that aims to simplify package management and deployment. To download and install please visit the Anaconda homepage.

Python Packages

Requests

pip install requests

lxml

pip install lxml

Datetime

pip install DateTime

SQLAlchemy

pip install SQLAlchemy

PyMySQL

pip install PyMySQL

MySQL

pip install mysql

Boto3

pip install boto3

The Code

Also available via GitHub

We first want to import all of the packages we specified above and set some initial settings.

import requests 
import lxml.html as lh
import pandas as pd
from datetime import datetime
from sqlalchemy import create_engine
import pymysql
import mysql.connector as sql
import boto3
from io import StringIO


pd.set_option('display.max_columns', None)

By browsing the FIBA rankings website we can pick out the URLs from which we need to scrape data from, since we want to loop through these one by one we throw them into a list.

#Url listing of the websites to grab FIBA rankings from 
urlList = {
    'Men':'https://www.fiba.basketball/rankingmen',
    'Women':'https://www.fiba.basketball/rankingwomen',
    'Boys':'https://www.fiba.basketball/rankingboys',
    'Girls':'https://www.fiba.basketball/rankinggirls'
    }

Next we specify the filename to drop into our AWS S3 Bucket and setup a blank data frame with column headings.

filename = str(datetime.date(datetime.now())) + '-FIBA_Rankings.csv'
df_rankings_all = pd.DataFrame(columns=['As_Of', 'Worldrank', 'Country', 'Area', 'Zonerank', 'IOC', 'Current points', '+/- Rank *'])

We loop through the below for each of the four URLs for the different categories of FIBA Team Canada. This code searches the URL for a table and pulls in the contents of it (aka the country rankings one). Towards the end of the loop we only include Team Canada (IOC = CAN), and append the data to a data frame so we can continue onto the next URL and have all the data stored in a single source.

for key in urlList: 
    #Create a handle, page, to handle the contents of the website
    page = requests.get(urlList[key])#Store the contents of the website under doc
    doc = lh.fromstring(page.content)#Parse data that are stored between <tr>..</tr> of HTML
    tr_elements = doc.xpath('//tr')
    
    tr_elements = doc.xpath('//tr')#Create empty list
    col=[]
    i=0#For each row, store each first element (header) and an empty list
    for t in tr_elements[0]:
        i+=1
        name=t.text_content()
        '%d:"%s"'%(i,name)
        col.append((name,[]))
        
    #Since out first row is the header, data is stored on the second row onwards
    for j in range(1,len(tr_elements)):
        #T is our j'th row
        T=tr_elements[j]
        
        #If row is not of size 10, the //tr data is not from our table 
        if len(T)!=6:
            break
        
        #i is the index of our column
        i=0
        
        #Iterate through each element of the row
        for t in T.iterchildren():
            data=t.text_content() 
            #Check if row is empty
            if i>0:
            #Convert any numerical value to integers
                try:
                    data=int(data)
                except:
                    pass
            #Append the data to the empty list of the i'th column
            col[i][1].append(data)
            #Increment i for the next column
            i+=1
    
    Dict={title:column for (title,column) in col}
    df=pd.DataFrame(Dict)
    
    df = df.replace(r'\r\n','',regex=True)
    
    df = df.loc[df['IOC'] == 'CAN']
    
    df.insert(2,'Area',key)
    df.insert(0, 'As_Of',datetime.date(datetime.now()))

    #df.to_csv('FIBA Rankings.csv', mode='a', index=False)
    df_rankings_all = df_rankings_all.append(df)

We want our newly pulled data available in two different ways, the first is a CSV file uploaded to our AWS S3 bucket.

####Load onto S3 bucket 
ACCESS_KEY_ID = <S3 access key ID here> 
ACCESS_SECRET_KEY = <S3 secret key here> 
BUCKET_NAME = <S3 bucket name here> 
FILE_NAME = 'FIBA/Historial_Rankings/' + filename

# S3 Connect
s3 = boto3.resource(
    's3',
    aws_access_key_id=ACCESS_KEY_ID,
    aws_secret_access_key=ACCESS_SECRET_KEY, 
    region_name = 'ca-central-1'
)

csv_buffer = StringIO()
df_rankings_all.to_csv(csv_buffer, index=False)

#Upload File
s3.Object(BUCKET_NAME, FILE_NAME).put(Body=csv_buffer.getvalue())
print ("Code worked to load file into S3 bucket!")

The next place we want to save our data is into our AWS MySQL database so that the tables on this site can refresh data from there and so we can easily track historical changes without having to combine a bunch of CSV files.

#####Load data into MySQL Database
# Credentials to database connection
host= <MySQL host name here> 
db= <MySQL database name here> 
usr= <MySQL User ID here>  
pwd= <MySQL User ID Password here> 

engine = create_engine("mysql+pymysql://" + usr + ":" + pwd + "@" + host + "/" + db)
print("Connected to MySQLDB")

df_rankings_all.to_sql(con=engine, name= <MySQL database table name here>, if_exists='append',index=False)
print("Code worked to load data into MySQL")

Check out our data page to see what the final product of the CSV file appears like and if you have any questions feel free to reach out!