How to scrape a table but 'not a table' from a page, using python?

Question

Humble greetings and welcome to anyone willing to spend time here. I shall introduce myself as a very green student of data science and also python. This thread is meant to gain insight from rather more fortunate minds capable of deeper understanding within the realm of python.

As we can see, the value for each row itself could be found easily on the page inspection. But it seems that they all are using the same class name. As for now, i’m afraid i couldnt even find the right keyword to search for any working method in google.

These are the codes that i’ve tried. They dont work and embaressing, but i have to show it anyway. Ive tried fiddling by adding .content, .text, find, find_all, but i understand that my failure lies at even deeper fundamental core.

from bs4 import BeautifulSoup
import requests
from csv import writer
import pandas as pd

url= 'https://m4.mobilelegends.com/stats'
page = requests.get(url)

soup = BeautifulSoup(page.text, 'html.parser')
lists = soup.find('div', class_="m4-team-stats-scroll")

with open('m4stats_team.csv', 'w', encoding='utf8', newline='') as f:
    thewriter = writer(f)
    header = ['Team', 'Win Rate', 'Average KDA', 'Average Kills', 'average Deaths', 'Average Assists', 'Average Game Time', 'Average Lord Kills', 'Average Tortoise Kills', 'Average Towers Destroy', 'First Blood Rate', 'Hero Pool']
    thewriter.writerow(header)

    for list in lists:
        team = list.find_all('p', class_="h3 pl-5 whitespace-nowrap hidden xl:block")
        awr = list.find_all('p', class_="h4")
        akda = list.find('p', class_="h4").text
        akill = list.find('p', class_="h4").text
        adeath = list.find('p', class_="h4").text
        aassist = list.find('p', class_="h4").text
        atime = list.find('p', class_="h4").text
        aalord = list.find('p', class_="h4").text
        atortoise = list.find('p', class_="h4").text
        atower = list.find('p', class_="h4").text
        firstblood = list.find('p', class_="h4").text
        hrpool = list.find('p', class_="h4").text


        info = [team, awr, akda, akill, adeath, aassist, atime, aalord, atortoise, atower, firstblood, hrpool]
        thewriter.writerow(info)

pd.read_csv('m4stats_team.csv').head()

What am i expecting:
Any kind of insight. Whether if it’s clue, keyword, code snippet, i do appreciate and mostfully thankful for any kind of guidance. I am not asking for somehow getting the complete scrapped CSV, as i couldve done it manually. At these point i want to be able to do basic webscraping myself.

Asked By: Hati

||

Source

Answer 1

You can iterate over rows in the table and its items.

from bs4 import BeautifulSoup
import requests

page = requests.get('https://m4.mobilelegends.com/stats')
page.raise_for_status()

page = BeautifulSoup(page.content)

table = page.find("div", class_="m4-team-stats-scroll")

with open("table.csv", "w") as file:
    for row in table.find_all("div", class_="m4-team-stats"):
        items = row.find_all("div", class_="col-span-1")
        # write into file in csv format, use map to extract text from items
        file.write(",".join(map(lambda item: item.text, items)) + "n")

Display output:

import pandas as pd

df = pd.read_csv("table.csv")

print(df)

# Outputs:
"""
      Team ↓Win Rate  ...  ↓First Blood Rate  ↓Hero pool
0     echo     72.0%  ...              48.0%          37
1      rrq     60.9%  ...              60.9%          37
2       tv     60.0%  ...              60.0%          29
3     fcon     55.0%  ...              85.0%          32
4      inc     53.3%  ...              26.7%          31
5     onic     52.9%  ...              47.1%          39
6     blck     52.2%  ...              47.8%          31
7   rrq-br     46.2%  ...              30.8%          32
8      thq     45.5%  ...              63.6%          27
9      s11     42.9%  ...              28.6%          26
10     tdk     37.5%  ...              62.5%          24
11      ot     28.6%  ...              28.6%          21
12     mvg     20.0%  ...              20.0%          15
13  rsg-sg     20.0%  ...              60.0%          17
14    burn      0.0%  ...              20.0%          21
15     mdh      0.0%  ...              40.0%          18

[16 rows x 12 columns]
"""

Answered By: Jurakin

Answer 2

There are several libraries in Python that can be used to scrape tables from a web page, such as BeautifulSoup and pandas. Here’s an example of how you can use BeautifulSoup to scrape a table from a webpage:

import requests
from bs4 import BeautifulSoup 
url = "https://example.com" 
page = requests.get(url) 
soup = BeautifulSoup(page.content, 'html.parser') 
table = soup.find_all('table')[0]

In this example, requests.get(url) retrieves the HTML content of the webpage at the specified URL, and BeautifulSoup(page.content, ‘html.parser’) parses the HTML content. The find_all() method is then used to find all table elements on the page, and the first one is assigned to the variable table.

To scrape a not a table element, you can use the same approach but instead of searching for the table element, you can search for any other element such as div, span, p, etc.

import requests 
from bs4 import BeautifulSoup 
url = "https://example.com" 
page = requests.get(url) 
soup = BeautifulSoup(page.content, 'html.parser') 
not_a_table = soup.find_all('div', {'class': 'not-a-table'})[0]

In this example, soup.find_all(‘div’, {‘class’: ‘not-a-table’}) finds all div elements with class "not-a-table" on the page, and the first one is assigned to the variable not_a_table.

Keep in mind that websites may have privacy policies, terms of service, and copyright laws that prohibit scraping their content without permission. If you are unable to find help from this code, you can follow this guide on data science and data mining process.

Answered By: Abdul Moiz

How to scrape a table but 'not a table' from a page, using python?

Question:

Answers: