Have read_html read *cell content* and *tool tip* (text bubble) separately, instead of concatenate them

Question:

This site page has tooltips (text bubbles) appearing when hovering over values in columns "Score" and "XP LVL".

It appears that read_html will concatenate cell content and tooltip. Splitting those in post-processing is not always obvious and I seek a way to have read_html handle them separately, possibly return them as two columns.

This is how the first row appears online:

(Rank)# Name Score XP LVL Victories / Total Victory Ratio
1 Rainin☆☆☆☆ 6129 447 408 / 531 76%
  • where "Score"‘s "6129" carries tooltip "Max6129"
  • where, more annoyingly, "XP LVL"‘s "447" carries tooltip "21173534 pts"

This is how it appears after reading:

pd.read_html('https://stats.gladiabots.com/pantheon?', header=0, flavor="html5lib")[0]

        #            Name         Score           XP LVL Victories / Total  
0       1      Rainin☆☆☆☆  6129Max 6129  44721173534 pts         408 / 531   

See "44721173534 pts" is the concatenation of "447" and "21173534 pts". "XP LVL" values have a variable number of digits, so splitting the string in the post-processing phase would require being pretty smart about it and I woud like to explore the "let read_html do the split", first.

(The special flavor="html5lib" was added because the page is dynamically-generated)

I have not found any mention of tooltips in the docs

Asked By: OCa

||

Answers:

You can use beautifulsoup to parse the page and then create the dataframe:

import pandas as pd
import requests
from bs4 import BeautifulSoup

url = "https://stats.gladiabots.com/pantheon"
soup = BeautifulSoup(requests.get(url).content, "html5lib")

all_data = []
for tr in soup.table.select("tr:has(td)"):
    all_data.append([])
    for td in tr.select("td"):
        all_data[-1].extend(td.get_text(strip=True, separator="###").split("###"))

df = pd.DataFrame(
    all_data, columns=["#", "Name", "Score", "Score2", "XP LVL", "PTS", "V/T", "Ratio"]
)
print(df.head())

Prints:

   #          Name Score    Score2 XP LVL           PTS          V/T Ratio
0  1    Rainin☆☆☆☆  6129  Max 6129    447  21173534 pts    408 / 531   76%
1  2      ZM_XL☆☆☆  5888  Max 6025    344  15942978 pts  3685 / 6748   54%
2  3   UzuraGames☆  5555  Max 5586    119   4688941 pts   610 / 1109   55%
3  4  Markolainen☆  5521  Max 5612    113   4433827 pts   763 / 1255   60%
4  5     Defunct☆☆  5337  Max 5452    225   9999855 pts  1535 / 3066   50%
Answered By: Andrej Kesely

It turns out that this is because pandas uses the .text attribute of the <td> bs4.element.Tag objects and this one concatenate (without any separator) the texts of all the tag’s children.

In the first row of the table, the score has two children 6129 and Max 6129, thus the concat.

<td nowrap="" class="barContainer">
  <div class="scoreBar" style="width: 100%;"></div>
  <div class="maxScoreBar" style="width: 0%;"></div>
  <span class="barLabel tooltipable">
    "6129"
    <span class="tooltip">
      "Max 6129"
    </span>
  </span>
</td>

A quick/hacky solution would be to override the _text_getter method of the parser used by pandas and replace .text with get_text that has a separator parameter :

def _text_getter(self, obj):
    return obj.get_text(separator="_", strip=True) # I choosed "_"

pd.io.html._BeautifulSoupHtml5LibFrameParser._text_getter = _text_getter

With this modification, read_html gives this df :

        #            Name          Score            XP LVL Victories / Total Victory_Ratio
0       1      Rainin☆☆☆☆  6129_Max 6129  447_21173534 pts         408 / 531           76%
1       2        ZM_XL☆☆☆  5888_Max 6025  344_15942978 pts       3685 / 6748           54%
2       3     UzuraGames☆  5555_Max 5586   119_4688941 pts        610 / 1109           55%
..    ...             ...            ...               ...               ...           ...
997   998          Tekuma  3183_Max 3460     27_370585 pts         151 / 304           49%
998   999            hemi  3183_Max 3227      10_49432 pts           29 / 62           46%
999  1000  wanna bet kid?  3183_Max 3304      13_85777 pts           51 / 95           53%

[1000 rows x 6 columns]

And this way, you can extract / disattach the values of the two concerned columns :

scores = df.pop("Score").str.extract(r"(?P<Score>d+)_Max (?P<Max>d+)")
xplvls = df.pop("XP LVL").str.extract(r"(?P<XPLVL>d+)_(?P<PTS>d+)")

out = pd.concat([df, scores, xplvls], axis=1)

Output :

print(out) # with only `scores` and `xplvls`

    Score   Max XPLVL       PTS
0    6129  6129   447  21173534
1    5888  6025   344  15942978
2    5555  5586   119   4688941
..    ...   ...   ...       ...
997  3183  3460    27    370585
998  3183  3227    10     49432
999  3183  3304    13     85777

[1000 rows x 4 columns]
Answered By: Timeless

To handle read_html the text bubbles separately, you can use the BeautifulSoup package to parse the HTML and extract the data.

import pandas as pd
import requests
from bs4 import BeautifulSoup

url = 'https://stats.gladiabots.com/pantheon'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

data = []
for row in soup.find_all('tr'):
    cols = row.find_all('td')
    if len(cols) == 0:
        continue
    score = cols[2].text.split('Max')[0]
    xp_lvl = cols[3].text.split(' pts')[0]
    data.append([cols[0].text, cols[1].text, score, xp_lvl, cols[4].text, cols[5].text])

df = pd.DataFrame(data, columns=['Rank', 'Name', 'Score', 'XP LVL', 'Victories / Total', 'Victory Ratio'])

Here I assumes that the "Score" column always has the "Max" text bubble, and the "XP LVL" column always has the " pts" text bubble. If this is not the case, you may need to modify the code accordingly.

Answered By: Tusher