Cannot scrape some table using Pandas

Question:

i’m more than a noob in python, i’m tryng to get some tables from this page:

https://www.basketball-reference.com/wnba/boxscores/202208030SEA.html

Using Pandas and command pd.read_html i’m able to get most of them but not the "Line Score" and the "Four Factors"…if i print all the table (they are 19) these two are missing, inspecting with chrome they seem to be table and i also get them with excel importing from web.
What am i missing here?
Any help appreciated, thanks!

Asked By: eestlane

||

Answers:

If you look at the page source (not by inspecting), you’d see those tables are within the comments of the html. You can either a) edit the html str and remove the <!-- and --> from the html, then let pandas parse, or 2) use bs4 to pull out the comments, then parse that tables that way.

I’ll show you both options:

Option 1: Remove the comment tags from the page source

import requests
import pandas as pd

url = 'https://www.basketball-reference.com/wnba/boxscores/202208030SEA.html'
response = requests.get(url).text.replace("<!--","").replace("-->","")

dfs = pd.read_html(response, header=1)

Output:

You can see you now have 21 tables, with the 4th and 5th tables the ones in question.

print(len(dfs))
for each in dfs[3:5]:
    print('nn', each, 'n')

21


        Unnamed: 0   1   2   3   4   T
0  Minnesota Lynx  18  14  22  23  77
1   Seattle Storm  30  26  22  11  89 



   Unnamed: 0  Pace   eFG%  TOV%  ORB%  FT/FGA   ORtg
0        MIN  97.0  0.507  16.1  14.3   0.101   95.2
1        SEA  97.0  0.579  11.8   9.7   0.114  110.1 

Option 2: Pull out comments with bs4

import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd


url = 'https://www.basketball-reference.com/wnba/boxscores/202208030SEA.html'
result = requests.get(url).text
data = BeautifulSoup(result, 'html.parser')


dfs = pd.read_html(url, header=1)

comments = data.find_all(string=lambda text: isinstance(text, Comment))

other_tables = []
for each in comments:
    if '<table' in str(each):
        try:
            other_tables.append(pd.read_html(str(each), header=1)[0])
        except:
            continue

Output:

for each in other_tables:
    print(each, 'n')


       Unnamed: 0   1   2   3   4   T
0  Minnesota Lynx  18  14  22  23  77
1   Seattle Storm  30  26  22  11  89 

  Unnamed: 0  Pace   eFG%  TOV%  ORB%  FT/FGA   ORtg
0        MIN  97.0  0.507  16.1  14.3   0.101   95.2
1        SEA  97.0  0.579  11.8   9.7   0.114  110.1 
Answered By: chitown88
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.