Python Requests-html not return the page content

Question:

I’m new to Python and would like your advice for the issue I’ve encountered recently. I’m doing a small project where I tried to scrape a comic website to download a chapter (pictures). However, when printing out the page content for testing (because i tried to use Beautifulsoup.select() and got no result), it only showed a line of html:

‘document.cookie="VinaHost-Shield=a7a00919549a80aa44d5e1df8a26ae20"+"; path=/";window.location.reload(true);’

Any help would be really appreciated.

from requests_html import HTMLSession
session = HTMLSession()

res = session.get("https://truyenqqpro.com/truyen-tranh/dao-hai-tac-128-chap-1060.html")
res.html.render()
print(res.content)

I also tried this but the resutl was the same.

import requests, bs4

url = "https://truyenqqpro.com/truyen-tranh/dao-hai-tac-128-chap-1060.html"
res = requests.get(url, headers={"User-Agent": "Requests"})
res.raise_for_status()
# soup = bs4.BeautifulSoup(res.text, "html.parser")
# onePiece = soup.select(".page-chapter")
print(res.content)

update: I installed docker and splash (on Windows 11) and it worked. I included the update code. Thanks Franz and others for yor help.

import os
import requests, bs4
os.makedirs("OnePiece", exist_ok=True)
url = "https://truyenqqpro.com/truyen-tranh/dao-hai-tac-128-chap-1060.html"
res = requests.get("http://localhost:8050/render.html", params={"url": url, "wait": 5})
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, "html.parser")
onePiece = soup.find_all("img", class_="lazy")
for element in onePiece:
    imageLink = "https:" + element["data-cdn"]
    res = requests.get(imageLink)
    imageFile = open(os.path.join("OnePiece", os.path.basename(imageLink)), "wb")
    for chunk in res.iter_content(100000):
        imageFile.write(chunk)
    imageFile.close()
Asked By: Jim

||

Answers:

import urllib.request
request_url = urllib.request.urlopen('https://truyenqqpro.com/truyen-tranh/dao-hai-tac-128-chap-1060.html')
print(request_url.read())

it will return html code of the page.
by the way in that html it is loading several images. you need to use regx to trakdown those img urls and download them.

Answered By: Abhi747

This response means that we need a javascript render that reload the page using this cookie. for you get the content some workaround must be added.

This

I commonly use splash scrapinhub render engine and putting a sleep in the page just renders ok all the content. Some tools that render in same way are selenium for python or pupitter in JS.

Link for Splash and Pupeteer

enter image description here

Answered By: Franz Kurt