Have read_html read *cell content* and *tool tip* (text bubble) separately, instead of concatenate them
Question:
This site page has tooltips (text bubbles) appearing when hovering over values in columns "Score"
and "XP LVL"
.
It appears that read_html
will concatenate cell content and tooltip. Splitting those in post-processing is not always obvious and I seek a way to have read_html
handle them separately, possibly return them as two columns.
This is how the first row appears online:
(Rank)#
Name
Score
XP LVL
Victories / Total
Victory Ratio
1
Rainin☆☆☆☆
6129
447
408 / 531
76%
- where
"Score"
‘s "6129" carries tooltip "Max6129"
- where, more annoyingly,
"XP LVL"
‘s "447" carries tooltip "21173534 pts"
This is how it appears after reading:
pd.read_html('https://stats.gladiabots.com/pantheon?', header=0, flavor="html5lib")[0]
# Name Score XP LVL Victories / Total
0 1 Rainin☆☆☆☆ 6129Max 6129 44721173534 pts 408 / 531
See "44721173534 pts" is the concatenation of "447" and "21173534 pts". "XP LVL"
values have a variable number of digits, so splitting the string in the post-processing phase would require being pretty smart about it and I woud like to explore the "let read_html do the split", first.
(The special flavor="html5lib" was added because the page is dynamically-generated)
I have not found any mention of tooltips in the docs
Answers:
You can use beautifulsoup
to parse the page and then create the dataframe:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://stats.gladiabots.com/pantheon"
soup = BeautifulSoup(requests.get(url).content, "html5lib")
all_data = []
for tr in soup.table.select("tr:has(td)"):
all_data.append([])
for td in tr.select("td"):
all_data[-1].extend(td.get_text(strip=True, separator="###").split("###"))
df = pd.DataFrame(
all_data, columns=["#", "Name", "Score", "Score2", "XP LVL", "PTS", "V/T", "Ratio"]
)
print(df.head())
Prints:
# Name Score Score2 XP LVL PTS V/T Ratio
0 1 Rainin☆☆☆☆ 6129 Max 6129 447 21173534 pts 408 / 531 76%
1 2 ZM_XL☆☆☆ 5888 Max 6025 344 15942978 pts 3685 / 6748 54%
2 3 UzuraGames☆ 5555 Max 5586 119 4688941 pts 610 / 1109 55%
3 4 Markolainen☆ 5521 Max 5612 113 4433827 pts 763 / 1255 60%
4 5 Defunct☆☆ 5337 Max 5452 225 9999855 pts 1535 / 3066 50%
It turns out that this is because pandas uses the .text
attribute of the <td>
bs4.element.Tag
objects and this one concatenate (without any separator) the texts of all the tag’s children.
In the first row of the table, the score has two children 6129
and Max 6129
, thus the concat.
<td nowrap="" class="barContainer">
<div class="scoreBar" style="width: 100%;"></div>
<div class="maxScoreBar" style="width: 0%;"></div>
<span class="barLabel tooltipable">
"6129"
<span class="tooltip">
"Max 6129"
</span>
</span>
</td>
A quick/hacky solution would be to override the _text_getter
method of the parser used by pandas and replace .text
with get_text
that has a separator
parameter :
def _text_getter(self, obj):
return obj.get_text(separator="_", strip=True) # I choosed "_"
pd.io.html._BeautifulSoupHtml5LibFrameParser._text_getter = _text_getter
With this modification, read_html
gives this df
:
# Name Score XP LVL Victories / Total Victory_Ratio
0 1 Rainin☆☆☆☆ 6129_Max 6129 447_21173534 pts 408 / 531 76%
1 2 ZM_XL☆☆☆ 5888_Max 6025 344_15942978 pts 3685 / 6748 54%
2 3 UzuraGames☆ 5555_Max 5586 119_4688941 pts 610 / 1109 55%
.. ... ... ... ... ... ...
997 998 Tekuma 3183_Max 3460 27_370585 pts 151 / 304 49%
998 999 hemi 3183_Max 3227 10_49432 pts 29 / 62 46%
999 1000 wanna bet kid? 3183_Max 3304 13_85777 pts 51 / 95 53%
[1000 rows x 6 columns]
And this way, you can extract
/ disattach the values of the two concerned columns :
scores = df.pop("Score").str.extract(r"(?P<Score>d+)_Max (?P<Max>d+)")
xplvls = df.pop("XP LVL").str.extract(r"(?P<XPLVL>d+)_(?P<PTS>d+)")
out = pd.concat([df, scores, xplvls], axis=1)
Output :
print(out) # with only `scores` and `xplvls`
Score Max XPLVL PTS
0 6129 6129 447 21173534
1 5888 6025 344 15942978
2 5555 5586 119 4688941
.. ... ... ... ...
997 3183 3460 27 370585
998 3183 3227 10 49432
999 3183 3304 13 85777
[1000 rows x 4 columns]
To handle read_html the text bubbles separately, you can use the BeautifulSoup package to parse the HTML and extract the data.
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://stats.gladiabots.com/pantheon'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
data = []
for row in soup.find_all('tr'):
cols = row.find_all('td')
if len(cols) == 0:
continue
score = cols[2].text.split('Max')[0]
xp_lvl = cols[3].text.split(' pts')[0]
data.append([cols[0].text, cols[1].text, score, xp_lvl, cols[4].text, cols[5].text])
df = pd.DataFrame(data, columns=['Rank', 'Name', 'Score', 'XP LVL', 'Victories / Total', 'Victory Ratio'])
Here I assumes that the "Score" column always has the "Max" text bubble, and the "XP LVL" column always has the " pts" text bubble. If this is not the case, you may need to modify the code accordingly.
This site page has tooltips (text bubbles) appearing when hovering over values in columns "Score"
and "XP LVL"
.
It appears that read_html
will concatenate cell content and tooltip. Splitting those in post-processing is not always obvious and I seek a way to have read_html
handle them separately, possibly return them as two columns.
This is how the first row appears online:
(Rank)# | Name | Score | XP LVL | Victories / Total | Victory Ratio |
---|---|---|---|---|---|
1 | Rainin☆☆☆☆ | 6129 | 447 | 408 / 531 | 76% |
- where
"Score"
‘s "6129" carries tooltip "Max6129" - where, more annoyingly,
"XP LVL"
‘s "447" carries tooltip "21173534 pts"
This is how it appears after reading:
pd.read_html('https://stats.gladiabots.com/pantheon?', header=0, flavor="html5lib")[0]
# Name Score XP LVL Victories / Total
0 1 Rainin☆☆☆☆ 6129Max 6129 44721173534 pts 408 / 531
See "44721173534 pts" is the concatenation of "447" and "21173534 pts". "XP LVL"
values have a variable number of digits, so splitting the string in the post-processing phase would require being pretty smart about it and I woud like to explore the "let read_html do the split", first.
(The special flavor="html5lib" was added because the page is dynamically-generated)
I have not found any mention of tooltips in the docs
You can use beautifulsoup
to parse the page and then create the dataframe:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://stats.gladiabots.com/pantheon"
soup = BeautifulSoup(requests.get(url).content, "html5lib")
all_data = []
for tr in soup.table.select("tr:has(td)"):
all_data.append([])
for td in tr.select("td"):
all_data[-1].extend(td.get_text(strip=True, separator="###").split("###"))
df = pd.DataFrame(
all_data, columns=["#", "Name", "Score", "Score2", "XP LVL", "PTS", "V/T", "Ratio"]
)
print(df.head())
Prints:
# Name Score Score2 XP LVL PTS V/T Ratio
0 1 Rainin☆☆☆☆ 6129 Max 6129 447 21173534 pts 408 / 531 76%
1 2 ZM_XL☆☆☆ 5888 Max 6025 344 15942978 pts 3685 / 6748 54%
2 3 UzuraGames☆ 5555 Max 5586 119 4688941 pts 610 / 1109 55%
3 4 Markolainen☆ 5521 Max 5612 113 4433827 pts 763 / 1255 60%
4 5 Defunct☆☆ 5337 Max 5452 225 9999855 pts 1535 / 3066 50%
It turns out that this is because pandas uses the .text
attribute of the <td>
bs4.element.Tag
objects and this one concatenate (without any separator) the texts of all the tag’s children.
In the first row of the table, the score has two children 6129
and Max 6129
, thus the concat.
<td nowrap="" class="barContainer">
<div class="scoreBar" style="width: 100%;"></div>
<div class="maxScoreBar" style="width: 0%;"></div>
<span class="barLabel tooltipable">
"6129"
<span class="tooltip">
"Max 6129"
</span>
</span>
</td>
A quick/hacky solution would be to override the _text_getter
method of the parser used by pandas and replace .text
with get_text
that has a separator
parameter :
def _text_getter(self, obj):
return obj.get_text(separator="_", strip=True) # I choosed "_"
pd.io.html._BeautifulSoupHtml5LibFrameParser._text_getter = _text_getter
With this modification, read_html
gives this df
:
# Name Score XP LVL Victories / Total Victory_Ratio
0 1 Rainin☆☆☆☆ 6129_Max 6129 447_21173534 pts 408 / 531 76%
1 2 ZM_XL☆☆☆ 5888_Max 6025 344_15942978 pts 3685 / 6748 54%
2 3 UzuraGames☆ 5555_Max 5586 119_4688941 pts 610 / 1109 55%
.. ... ... ... ... ... ...
997 998 Tekuma 3183_Max 3460 27_370585 pts 151 / 304 49%
998 999 hemi 3183_Max 3227 10_49432 pts 29 / 62 46%
999 1000 wanna bet kid? 3183_Max 3304 13_85777 pts 51 / 95 53%
[1000 rows x 6 columns]
And this way, you can extract
/ disattach the values of the two concerned columns :
scores = df.pop("Score").str.extract(r"(?P<Score>d+)_Max (?P<Max>d+)")
xplvls = df.pop("XP LVL").str.extract(r"(?P<XPLVL>d+)_(?P<PTS>d+)")
out = pd.concat([df, scores, xplvls], axis=1)
Output :
print(out) # with only `scores` and `xplvls`
Score Max XPLVL PTS
0 6129 6129 447 21173534
1 5888 6025 344 15942978
2 5555 5586 119 4688941
.. ... ... ... ...
997 3183 3460 27 370585
998 3183 3227 10 49432
999 3183 3304 13 85777
[1000 rows x 4 columns]
To handle read_html the text bubbles separately, you can use the BeautifulSoup package to parse the HTML and extract the data.
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://stats.gladiabots.com/pantheon'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
data = []
for row in soup.find_all('tr'):
cols = row.find_all('td')
if len(cols) == 0:
continue
score = cols[2].text.split('Max')[0]
xp_lvl = cols[3].text.split(' pts')[0]
data.append([cols[0].text, cols[1].text, score, xp_lvl, cols[4].text, cols[5].text])
df = pd.DataFrame(data, columns=['Rank', 'Name', 'Score', 'XP LVL', 'Victories / Total', 'Victory Ratio'])
Here I assumes that the "Score" column always has the "Max" text bubble, and the "XP LVL" column always has the " pts" text bubble. If this is not the case, you may need to modify the code accordingly.