How to use BeautifulSoup to find "Description" by div class_=css-gz8dae?
Question:
I am new to Python that I am learning for scraping purposes. I am using BeautifulSoup to collect descriptions from job offers at: https://justjoin.it/offers/itds-net-fullstack-developer-angular
On another site with job offers, using the same code with different div classes I can find what I need. I wrote this piece of code for justjoin.it
import requests
from bs4 import BeautifulSoup
link="https://justjoin.it/offers/jungle-devops-engineer"
response_IDs=requests.get(link)
soup=BeautifulSoup(response_IDs.text, 'html.parser')
Search_part = soup.find(id='root')
description= Search_part.find_all('div', class_='css-gz8dae')
for i in description:
print(i)
Please, help me to write a functional code.
Answers:
As mentioned in the comments, the issue is that the content on this site is rendered using JavaScript, so requests will not be able to scrape the dynamic content. Selenium can fix this issue as it uses a web driver to render/execute the JavaScript.
First, make sure you have installed Selenium:
pip install selenium
For google colab please add a !
in frond of pip install
(see below).
As I mentioned I run all my python on google colab, which uses FireFox. This works for me:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
link = "https://justjoin.it/offers/jungle-devops-engineer"
# Set up headless browser (no GUI)
options = Options()
options.headless = True
browser = webdriver.Firefox(options=options)
# Use Selenium to get the page source after JavaScript has executed
browser.get(link)
page_source = browser.page_source
browser.quit()
# Use BeautifulSoup to parse the HTML
soup = BeautifulSoup(page_source, 'html.parser')
description = soup.find_all('div', class_='css-gz8dae')
for i in description:
print(i.text)
This is the output:
Running a flexible Machine Learning engine at scale is hard.
We must ingest and process large volumes of data
uninterruptedly and store it in a scalable manner.
The data needs to be prepared and served to hundreds of
models constantly. All the predictions of the models, as well as other data pipelines, ...
In case you use chrome change this line
browser = webdriver.Firefox(options=options)
with this:
browser = webdriver.Chrome(options=options)
To run the whole thing on google colab you need to install selenium and firefox like this first:
!pip install selenium
!apt-get update
!apt install -y firefox
!apt install -y wget
!apt install -y unzip
Then, you will also need the GeckoDriver which should be set in the system’s PATH:
!wget https://github.com/mozilla/geckodriver/releases/download/v0.30.0/geckodriver-v0.30.0-linux64.tar.gz
!tar -xvf geckodriver-v0.30.0-linux64.tar.gz
!chmod +x geckodriver
!mv geckodriver /usr/local/bin/
And after these installations run the code above.
As Pawel Kam and cconsta1 have explained, in order for the website to fully render, a bunch of JS needs to be executed. If you want the entirety of the website’s HTML then just use selenium (as cconsta1 has detailed in their answer). But if you only want the info in Description section of the job posting, then the following solution is arguably more appropriate.
Getting the JSON file that contains the job Description info.
Using my browser’s Dev Tools I found that the website makes a GET request to this API to get all of the information you see on the job posting. Specifically, the response to the request is a JSON.
Thus, if you only want data shown in the job posting, all you have to do is request the JSON file and then use BeautifulSoup to parse it for the specific data you want.
I found this article helpful when I was first learning about web scraping by "reverse engineering" a website’s requests.
The following script can be used to get the JSON file and parse the HTML of the Description section:
import requests
import json
from bs4 import BeautifulSoup
def pretty_print_json(json_obj):
json_string = json.dumps(json_obj, indent=4)
print(json_string)
def get_json(url, req_headers):
response = requests.get(url, headers=req_headers)
# makes JSON file into dict object
return response.json()
def find_first_element(html, tag):
soup = BeautifulSoup(html, 'html.parser')
# find first occurance of given element
element = soup.find(tag)
return element
def pretty_print_html(html):
soup = BeautifulSoup(html, 'html.parser')
print(soup.prettify())
if __name__ == "__main__":
url = "https://justjoin.it/api/offers/itds-net-fullstack-developer-angular"
api_headers = {
"X-CSRF-Token": "/w2ocZnRs5LN43gzQsi8zWYcdAOVmhjBEpB/dduBn5rnhzjqOnvlo7SsrEdf5Rht3Aa2x/+/00OZJuh3tgmaDA=="
}
json_obj = get_json(url, api_headers)
# view entire JSON file (in a readable format)
# to familiarize yourself where its structure
pretty_print_json(json_obj)
# access HTML that makes up Description section of job posting
job_description_html = json_obj['body']
# look at job description html
pretty_print_html(job_description_html)
# get the job summary (i.e. the opening paragraph of Description section)
job_summary = find_first_element(job_description_html, 'div').text
print(job_summary)
The other print outputs are kind of large, so I’ll only show the output of print(job_summary)
:
As a .NET FullStack Developer (Angular) you will be working on implementing innovative
architectural solutions for our client in the banking sector. Our client is the first
fully online bank in Poland, setting directions for the development of mobile and online
banking. It is one of the strongest and fastest growing financial brands in Poland. Your
key responsibilities:
You’ll have to play around with it to get the exact info you want. Let me know if you need me to clarify anything.
I am new to Python that I am learning for scraping purposes. I am using BeautifulSoup to collect descriptions from job offers at: https://justjoin.it/offers/itds-net-fullstack-developer-angular
On another site with job offers, using the same code with different div classes I can find what I need. I wrote this piece of code for justjoin.it
import requests
from bs4 import BeautifulSoup
link="https://justjoin.it/offers/jungle-devops-engineer"
response_IDs=requests.get(link)
soup=BeautifulSoup(response_IDs.text, 'html.parser')
Search_part = soup.find(id='root')
description= Search_part.find_all('div', class_='css-gz8dae')
for i in description:
print(i)
Please, help me to write a functional code.
As mentioned in the comments, the issue is that the content on this site is rendered using JavaScript, so requests will not be able to scrape the dynamic content. Selenium can fix this issue as it uses a web driver to render/execute the JavaScript.
First, make sure you have installed Selenium:
pip install selenium
For google colab please add a !
in frond of pip install
(see below).
As I mentioned I run all my python on google colab, which uses FireFox. This works for me:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
link = "https://justjoin.it/offers/jungle-devops-engineer"
# Set up headless browser (no GUI)
options = Options()
options.headless = True
browser = webdriver.Firefox(options=options)
# Use Selenium to get the page source after JavaScript has executed
browser.get(link)
page_source = browser.page_source
browser.quit()
# Use BeautifulSoup to parse the HTML
soup = BeautifulSoup(page_source, 'html.parser')
description = soup.find_all('div', class_='css-gz8dae')
for i in description:
print(i.text)
This is the output:
Running a flexible Machine Learning engine at scale is hard.
We must ingest and process large volumes of data
uninterruptedly and store it in a scalable manner.
The data needs to be prepared and served to hundreds of
models constantly. All the predictions of the models, as well as other data pipelines, ...
In case you use chrome change this line
browser = webdriver.Firefox(options=options)
with this:
browser = webdriver.Chrome(options=options)
To run the whole thing on google colab you need to install selenium and firefox like this first:
!pip install selenium
!apt-get update
!apt install -y firefox
!apt install -y wget
!apt install -y unzip
Then, you will also need the GeckoDriver which should be set in the system’s PATH:
!wget https://github.com/mozilla/geckodriver/releases/download/v0.30.0/geckodriver-v0.30.0-linux64.tar.gz
!tar -xvf geckodriver-v0.30.0-linux64.tar.gz
!chmod +x geckodriver
!mv geckodriver /usr/local/bin/
And after these installations run the code above.
As Pawel Kam and cconsta1 have explained, in order for the website to fully render, a bunch of JS needs to be executed. If you want the entirety of the website’s HTML then just use selenium (as cconsta1 has detailed in their answer). But if you only want the info in Description section of the job posting, then the following solution is arguably more appropriate.
Getting the JSON file that contains the job Description info.
Using my browser’s Dev Tools I found that the website makes a GET request to this API to get all of the information you see on the job posting. Specifically, the response to the request is a JSON.
Thus, if you only want data shown in the job posting, all you have to do is request the JSON file and then use BeautifulSoup to parse it for the specific data you want.
I found this article helpful when I was first learning about web scraping by "reverse engineering" a website’s requests.
The following script can be used to get the JSON file and parse the HTML of the Description section:
import requests
import json
from bs4 import BeautifulSoup
def pretty_print_json(json_obj):
json_string = json.dumps(json_obj, indent=4)
print(json_string)
def get_json(url, req_headers):
response = requests.get(url, headers=req_headers)
# makes JSON file into dict object
return response.json()
def find_first_element(html, tag):
soup = BeautifulSoup(html, 'html.parser')
# find first occurance of given element
element = soup.find(tag)
return element
def pretty_print_html(html):
soup = BeautifulSoup(html, 'html.parser')
print(soup.prettify())
if __name__ == "__main__":
url = "https://justjoin.it/api/offers/itds-net-fullstack-developer-angular"
api_headers = {
"X-CSRF-Token": "/w2ocZnRs5LN43gzQsi8zWYcdAOVmhjBEpB/dduBn5rnhzjqOnvlo7SsrEdf5Rht3Aa2x/+/00OZJuh3tgmaDA=="
}
json_obj = get_json(url, api_headers)
# view entire JSON file (in a readable format)
# to familiarize yourself where its structure
pretty_print_json(json_obj)
# access HTML that makes up Description section of job posting
job_description_html = json_obj['body']
# look at job description html
pretty_print_html(job_description_html)
# get the job summary (i.e. the opening paragraph of Description section)
job_summary = find_first_element(job_description_html, 'div').text
print(job_summary)
The other print outputs are kind of large, so I’ll only show the output of print(job_summary)
:
As a .NET FullStack Developer (Angular) you will be working on implementing innovative
architectural solutions for our client in the banking sector. Our client is the first
fully online bank in Poland, setting directions for the development of mobile and online
banking. It is one of the strongest and fastest growing financial brands in Poland. Your
key responsibilities:
You’ll have to play around with it to get the exact info you want. Let me know if you need me to clarify anything.