How to scrape from a Jquery table using Python
Question:
I am trying to scrape the first ten items from this website. I am using Python Selenium/BeautifulSoup. It seems the table is loading using some jquery script. I am honestly stumped where to start as the tutorials and guides aren’t matching up with this website.
EX: A lot of them are saying check the Network tab in inspect element to find the XHR data. This website however doesn’t have anything worth value load in the XHR tab but rather in the JS tab. I found the request URl https://www.anime-planet.com/dist/3p/jquery.min.js?t=1657108207
but it doesn’t seem to do me any justice.
Am I overthinking things and should scrape from the html directly? Any advice would be very appreciated.
Answers:
This table is NOT loaded from jQuery. It is server-rendered and easily scrapable. You only need requests
and beautifulsoup
; Selenium is unnecessary.
With some quick DOM inspection, this should be pretty simple. You can do something like this:
import requests
from bs4 import BeautifulSoup
# send HTTP request to page and parse
page = requests.get('https://www.anime-planet.com/manga/top-manga/week')
soup = BeautifulSoup(page.text, 'html.parser')
top_ten = []
count = 0
for link in soup.select('td.tableTitle > a.tooltip'):
if count == 10:
break
top_ten.append(link.getText())
count += 1
print(top_ten)
Here is a solution based on pandas & requests:
import requests
import pandas as pd
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
url = 'https://www.anime-planet.com/manga/top-manga/week'
r = requests.get(url, headers=headers)
df = pd.read_html(r.text)[0]
print(df[:10])
Result printed in terminal:
Rank
Title
Year
StatusEpsRating
0
1
The Beginning After the End
2018
Unread Reading Want to Read Stalled Dropped Won’t Read
1
2
Damn Reincarnation
2022
Unread Reading Want to Read Stalled Dropped Won’t Read
2
3
One Piece
1997
Unread Reading Want to Read Stalled Dropped Won’t Read
3
4
Omniscient Reader
2020
Unread Reading Want to Read Stalled Dropped Won’t Read
4
5
The Swordmaster’s Son
2022
Unread Reading Want to Read Stalled Dropped Won’t Read
5
6
My School Life Pretending To Be a Worthless Person
2022
Unread Reading Want to Read Stalled Dropped Won’t Read
6
7
Player Who Returned 10,000 Years Later
2022
Unread Reading Want to Read Stalled Dropped Won’t Read
7
8
Jujutsu Kaisen
2018
Unread Reading Want to Read Stalled Dropped Won’t Read
8
9
Grim Reaper’s Floating Moon
2022
Unread Reading Want to Read Stalled Dropped Won’t Read
9
10
Villains Are Destined to Die
2020
Unread Reading Want to Read Stalled Dropped Won’t Read
Relevant pandas docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html
And for requests: https://requests.readthedocs.io/en/latest/
I am trying to scrape the first ten items from this website. I am using Python Selenium/BeautifulSoup. It seems the table is loading using some jquery script. I am honestly stumped where to start as the tutorials and guides aren’t matching up with this website.
EX: A lot of them are saying check the Network tab in inspect element to find the XHR data. This website however doesn’t have anything worth value load in the XHR tab but rather in the JS tab. I found the request URl https://www.anime-planet.com/dist/3p/jquery.min.js?t=1657108207
but it doesn’t seem to do me any justice.
Am I overthinking things and should scrape from the html directly? Any advice would be very appreciated.
This table is NOT loaded from jQuery. It is server-rendered and easily scrapable. You only need requests
and beautifulsoup
; Selenium is unnecessary.
With some quick DOM inspection, this should be pretty simple. You can do something like this:
import requests
from bs4 import BeautifulSoup
# send HTTP request to page and parse
page = requests.get('https://www.anime-planet.com/manga/top-manga/week')
soup = BeautifulSoup(page.text, 'html.parser')
top_ten = []
count = 0
for link in soup.select('td.tableTitle > a.tooltip'):
if count == 10:
break
top_ten.append(link.getText())
count += 1
print(top_ten)
Here is a solution based on pandas & requests:
import requests
import pandas as pd
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
url = 'https://www.anime-planet.com/manga/top-manga/week'
r = requests.get(url, headers=headers)
df = pd.read_html(r.text)[0]
print(df[:10])
Result printed in terminal:
Rank | Title | Year | StatusEpsRating | |
---|---|---|---|---|
0 | 1 | The Beginning After the End | 2018 | Unread Reading Want to Read Stalled Dropped Won’t Read |
1 | 2 | Damn Reincarnation | 2022 | Unread Reading Want to Read Stalled Dropped Won’t Read |
2 | 3 | One Piece | 1997 | Unread Reading Want to Read Stalled Dropped Won’t Read |
3 | 4 | Omniscient Reader | 2020 | Unread Reading Want to Read Stalled Dropped Won’t Read |
4 | 5 | The Swordmaster’s Son | 2022 | Unread Reading Want to Read Stalled Dropped Won’t Read |
5 | 6 | My School Life Pretending To Be a Worthless Person | 2022 | Unread Reading Want to Read Stalled Dropped Won’t Read |
6 | 7 | Player Who Returned 10,000 Years Later | 2022 | Unread Reading Want to Read Stalled Dropped Won’t Read |
7 | 8 | Jujutsu Kaisen | 2018 | Unread Reading Want to Read Stalled Dropped Won’t Read |
8 | 9 | Grim Reaper’s Floating Moon | 2022 | Unread Reading Want to Read Stalled Dropped Won’t Read |
9 | 10 | Villains Are Destined to Die | 2020 | Unread Reading Want to Read Stalled Dropped Won’t Read |
Relevant pandas docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html
And for requests: https://requests.readthedocs.io/en/latest/