Web scraping with python in javascript dynamic website
Question:
I need to scrape all article, title of article and paragraf in this web: https://portaljuridic.gencat.cat/eli/es-ct/l/2014/12/29/19
The problem is than I tried some of div, h3 or p nothing happen add image.
from bs4 import BeautifulSoup
import lxml
import pandas as pd
from tqdm import tqdm_notebook
def parse_url(url):
response = requests.get(url)
content = response.content
parsed_response = BeautifulSoup(content, "lxml")
return parsed_response
url = "https://portaljuridic.gencat.cat/eli/es-ct/l/2014/12/29/19"
soup = parse_url(url)
article = soup.find("div", {"class":"article-document"})
article
It seems to be a website with javascript, but I don’t know how to get it.
Answers:
The website does 3 API calls in order to get the data.
The code below does the same and get the data.
(In the browser do F12 -> Network -> XHR and see the API calls)
import requests
payload1 = {'language':'ca','documentId':680124}
r1 = requests.post('https://portaldogc.gencat.cat/eadop-rest/api/pjc/getListTraceabilityStandard',data = payload1)
if r1.status_code == 200:
print(r1.json())
print('------------------')
payload2 = {'documentId':680124,'orderBy':'DESC','language':'ca','traceability':'02'}
r2 = requests.post('https://portaldogc.gencat.cat/eadop-rest/api/pjc/getListValidityByDocument',data = payload2)
if r2.status_code == 200:
print(r2.json())
print('------------------')
payload3 = {'documentId': 680124,'traceabilityStandard': '02','language': 'ca'}
r3 = requests.post('https://portaldogc.gencat.cat/eadop-rest/api/pjc/documentPJC',data=payload3)
if r3.status_code == 200:
print(r3.json())
You can use Selenium to automate web browser interaction as it simulates a browser and you can wait till the JavaScript component loads completely.
It provides you an option for using headless chrome instead as well ( just like in the example below )
You can check out the following script which scrapes the titles and all the paragraphs from the URL, and saves them in a txt file.
import time
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = Chrome(options=chrome_options)
url = "https://portaljuridic.gencat.cat/eli/es-ct/l/2014/12/29/19"
driver.get(url)
time.sleep(5)
title = driver.find_element_by_css_selector(".titol-document").text
print(title)
paragraphs = driver.find_element_by_css_selector(
"div#art-1").find_elements_by_css_selector("div")[1:]
print(paragraphs)
file = open("article.txt", "w")
for paragraph in paragraphs:
file.write(paragraph.text)
You can adjust the time.sleep function as per the network speed.
You can read more about Selenium over here
Also, as mentioned in the comments to the prior answer – this will automatically extract the content where all special characters would already be parsed.
If you don’t want to use a browser like Selenium and its counterparts (e.g. Puppeteer, Playwright), you can use solutions that offer JS rendering like some of the web scraping APIs recommended in this article.
I need to scrape all article, title of article and paragraf in this web: https://portaljuridic.gencat.cat/eli/es-ct/l/2014/12/29/19
The problem is than I tried some of div, h3 or p nothing happen add image.
from bs4 import BeautifulSoup
import lxml
import pandas as pd
from tqdm import tqdm_notebook
def parse_url(url):
response = requests.get(url)
content = response.content
parsed_response = BeautifulSoup(content, "lxml")
return parsed_response
url = "https://portaljuridic.gencat.cat/eli/es-ct/l/2014/12/29/19"
soup = parse_url(url)
article = soup.find("div", {"class":"article-document"})
article
It seems to be a website with javascript, but I don’t know how to get it.
The website does 3 API calls in order to get the data.
The code below does the same and get the data.
(In the browser do F12 -> Network -> XHR and see the API calls)
import requests
payload1 = {'language':'ca','documentId':680124}
r1 = requests.post('https://portaldogc.gencat.cat/eadop-rest/api/pjc/getListTraceabilityStandard',data = payload1)
if r1.status_code == 200:
print(r1.json())
print('------------------')
payload2 = {'documentId':680124,'orderBy':'DESC','language':'ca','traceability':'02'}
r2 = requests.post('https://portaldogc.gencat.cat/eadop-rest/api/pjc/getListValidityByDocument',data = payload2)
if r2.status_code == 200:
print(r2.json())
print('------------------')
payload3 = {'documentId': 680124,'traceabilityStandard': '02','language': 'ca'}
r3 = requests.post('https://portaldogc.gencat.cat/eadop-rest/api/pjc/documentPJC',data=payload3)
if r3.status_code == 200:
print(r3.json())
You can use Selenium to automate web browser interaction as it simulates a browser and you can wait till the JavaScript component loads completely.
It provides you an option for using headless chrome instead as well ( just like in the example below )
You can check out the following script which scrapes the titles and all the paragraphs from the URL, and saves them in a txt file.
import time
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = Chrome(options=chrome_options)
url = "https://portaljuridic.gencat.cat/eli/es-ct/l/2014/12/29/19"
driver.get(url)
time.sleep(5)
title = driver.find_element_by_css_selector(".titol-document").text
print(title)
paragraphs = driver.find_element_by_css_selector(
"div#art-1").find_elements_by_css_selector("div")[1:]
print(paragraphs)
file = open("article.txt", "w")
for paragraph in paragraphs:
file.write(paragraph.text)
You can adjust the time.sleep function as per the network speed.
You can read more about Selenium over here
Also, as mentioned in the comments to the prior answer – this will automatically extract the content where all special characters would already be parsed.
If you don’t want to use a browser like Selenium and its counterparts (e.g. Puppeteer, Playwright), you can use solutions that offer JS rendering like some of the web scraping APIs recommended in this article.