Scrape and extract job data from Google Jobs using selenium and store in Pandas DataFrame

Question:

HI I’m new to StackOverflow. Apoligies in advance if the post is not well structured.

I have been learning web scraping in python and as part of a hobby project I was developing I was trying to web scrape Google Jobs and extract specific data to be stored in a pandas data frame.
I’m using selenium on python to achieve this.

So, the main challenge for me was to figure out a way to scrape all the job records from the site obtained from the search query (url = Google Jobs). This was difficult only because google jobs is dynamically loading ie. infinite scrolling, and the page initially loads only 10 results in the side panel. Upon scrolling down, only 10 more results are loaded successively with each scroll.

Website preview

I used selenium to help me with this. I figured that I can somehow automate the scrolling by instructing selenium to scroll into view the list element (<li>) associated with the last job entry in the side panel and run a for loop to repeat it till all results are loaded onto the page.

Then I just had to extract the list elements and store their text into a data frame.

The problem is each of the job entries has anywhere between 3 – 6 lines of text with each line representing some attribute like Job Title or Company name or Location etc., with the number of lines of each job entry being different, resulting in some entries with more lines than the others.

Different number of lines for each job entry

So when I split the text into a python list using ‘n’ as the seperator, it results in lists with different lengths.
This becomes a problem for me when i use pd.DataFrame(list) to generate a dataframe, resulting in records with jumbled order of fields.

Different Length Lists

Below is the code I have come up with:

#imports
import pandas as pd
import numpy as np
from serpapi import GoogleSearch
import requests
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

#using selenium to launch and scroll through the Google Jobs page
url = "https://www.google.com/search?q=google+jobs+data+analyst&oq=google+jobs+data+analyst&aqs=chrome..69i57j69i59j0i512j0i22i30i625l4j69i60.4543j0j7&sourceid=chrome&ie=UTF-8&ibp=htl;jobs&sa=X&ved=2ahUKEwjXsv-_iZP9AhVPRmwGHX5xDEsQutcGKAF6BAgPEAU&sxsrf=AJOqlzWGHNISzgpAUCZBmQA1mWXXt3I7gA:1676311105893#htivrt=jobs&htidocid=GS94rKdYQqQAAAAAAAAAAA%3D%3D&fpstate=tldetail"
driver = webdriver.Chrome()
driver.get(url)
joblist =[]

#pointing to the html element to scroll to
elementxpath = '//*[@id="immersive_desktop_root"]/div/div[3]/div[1]/div[1]/div[3]/ul/li[10]'
element = driver.find_element(By.XPATH,elementxpath)
driver.execute_script('arguments[0].scrollIntoView(true)',element)
datas = driver.find_elements(By.XPATH,'//*

#capturing all the job list objects in the first page
[@id="immersive_desktop_root"]/div/div[3]/div[1]/div[1]/div[3]/ul/li') 
joblist.append([da.text for da in datas])

#adding 3s delay for website to load after scrolling before executing code
time.sleep(3)

#capturing all the job list objects in the second set of 10 results loaded after 1st scroll down
elementxpath = '//*[@id="VoQFxe"]/div/div/ul/li[10]'
element = driver.find_element(By.XPATH,elementxpath)
driver.execute_script('arguments[0].scrollIntoView(true)',element)
datas = driver.find_elements(By.XPATH,'//*[@id="VoQFxe"]/div/div/ul/li')
joblist.append([da.text for da in datas])
x=2
time.sleep(3)

#using a while loop to scroll and capture for the remaining scroll downs as element xpath is in iterable format unlike th previous 2 xpaths
while True:
    elementxpath = '//*[@id="VoQFxe"]/div['+str(1*x)+']/div/ul/li[10]'
    element = driver.find_element(By.XPATH,elementxpath)
    driver.execute_script('arguments[0].scrollIntoView(true)',element)
    x+=1
    time.sleep(3)
    datas = driver.find_elements(By.XPATH,'//*[@id="VoQFxe"]/div['+str(1*x)+']/div/ul/li')
    joblist.append([da.text for da in datas])
    if x>1000:
        break
    else:
        continue

#unpacking and cleaning captured values from joblist to a newlist of lists in the desired format for creating a dataframe
jlist = []
for n in joblist:
    for a in range(0,len(n)-1):
        if n[a]!='':
            jlist.append(n[a].split('n'))

jobdf = pd.DataFrame(jlist)
jobdf.columns = ['Logo','Role', 'Company', 'Source','Posted','Full / Part Time', 'Waste']
jobdf

This is the output data frame:

Jumbled mess

Men and Women of culture, I implore your help to get a ordered DataFrame that makes sense. Thank you!

Asked By: Richard T Vetticad

||

Answers:

Usually you can use .split('n') only in simple cases, but in this case is a bad idea. A better practice is to use a unique xpath for each element you want to scrape, one for the logo, one for role, etc.

Another good practice is to initialize a dictionary at the beginning with one key for each element you want to scrape, and then append data as you loop over the jobs.

The following code does exactly this. It is not optimized for speed, in fact it scrolls to each job and scrape it, while the best way would be to scrape data of all the displayed jobs and then scroll to the bottom, then scrape all the new jobs and scroll again, and so on.

# import libraries...
# load webpage...

from selenium.common.exceptions import NoSuchElementException
xpaths = {
 'Logo'            :"./div[1]//img",
 'Role'            :"./div[2]",
 'Company'         :"./div[4]/div/div[1]",
 'Location'        :"./div[4]/div/div[2]",
 'Source'          :"./div[4]/div/div[3]",
 'Posted'          :"./div[4]/div/div[4]/div[1]",
 'Full / Part Time':"./div[4]/div/div[4]/div[2]",
}
data = {key:[] for key in xpaths}
jobs_to_do = 100
jobs_done = 0

while jobs_done < jobs_to_do:
    lis = driver.find_elements(By.XPATH, "//li[@data-ved]//div[@role='treeitem']/div/div")
    
    for li in lis[jobs_done:]:
        driver.execute_script('arguments[0].scrollIntoView({block: "center", behavior: "smooth"});', li)
        
        for key in xpaths:
            try:
                t = li.find_element(By.XPATH, xpaths[key]).get_attribute('src' if key=='Logo' else 'innerText')
            except NoSuchElementException:
                t = '*missing data*'
            data[key].append(t)
        
        jobs_done += 1
        print(f'{jobs_done=}', end='r')
        time.sleep(.2)

Then by running pd.DataFrame(data) you get something like this

enter image description here

As you can see from the image, some values in the "Posted" column should be put instead in the column "Full / Part Time". This happens because some jobs don’t have info about posted time. I noticed also that some jobs not only have "posted" and "full/part time" data, but also the "salary". So you should adjust the code to take into account these facts, it is not so easy because the HTML objects don’t have specific classes for each element, so I think you have to exploit the svg symbols (clock, bag and banknote) shown in this image

enter image description here

UPDATE

I tried using the svg paths to correctly scrape "posted", "full/part time" and "salary" and it works! Here are the paths

xpaths = {
 'Logo'            :"./div[1]//img",
 'Role'            :"./div[2]",
 'Company'         :"./div[4]/div/div[1]",
 'Location'        :"./div[4]/div/div[2]",
 'Source'          :"./div[4]/div/div[3]",
 'Posted'          :".//*[name()='path'][contains(@d,'M11.99')]/ancestor::div[1]",
 'Full / Part Time':".//*[name()='path'][contains(@d,'M20 6')]/ancestor::div[1]",
 'Salary'          :".//*[name()='path'][@fill-rule='evenodd']/ancestor::div[1]"
}

Replace the old paths with the new ones and it will work as expected, as shown in the picture below

enter image description here

Answered By: sound wave

You imported Google Search from SerpApi, but you don’t use it in the shown code.

Since you can extract data with Google Jobs API, this answer will focus on just that, to extract information with pagination

Pagination can be implemented using a while loop:

while True:
    search = GoogleSearch(params)               # where data extraction happens on the SerpApi backend
    result_dict = search.get_dict()             # JSON -> Python dict

    if 'error' in result_dict:
        break

    # data extraction will be here

    params['start'] += 10

Check full code in the online IDE.

from serpapi import GoogleSearch
import json

params = {
    'api_key': "...",                           # https://serpapi.com/manage-api-key
    'uule': 'w+CAIQICINVW5pdGVkIFN0YXRlcw',     # encoded location (USA), https://site-analyzer.pro/services-seo/uule/
    'q': 'google jobs data analyst',            # search query
    'hl': 'en',                                 # language of the search
    'gl': 'us',                                 # country of the search
    'engine': 'google_jobs',                    # SerpApi search engine
    'start': 0                                  # pagination
}

google_jobs_results = []

while True:
    search = GoogleSearch(params)               # where data extraction happens on the SerpApi backend
    result_dict = search.get_dict()             # JSON -> Python dict

    if 'error' in result_dict:
        break
    
    for result in result_dict['jobs_results']:
        google_jobs_results.append(result)

    # increments `start` parameter to 10 which will trigger Google Jobs to paginate to the next page
    params['start'] += 10

print(json.dumps(google_jobs_results, indent=2, ensure_ascii=False))

Example output:

[
  {
    "title": "Senior Web Analytics Developer",
    "company_name": "Syndicatebleu",
    "location": "Los Angeles, CA",
    "via": "via ZipRecruiter",
    "description": "It’s a hybrid role based out of Santa Monica, most likely 1-2 days in office per week!nnPay: $41 - $61/hour...nnSenior Web Analytics Developer (TEMP)nnJob Description:nnThis position is responsible for championing and supporting the analytics needs by providing high quality web analytics implementation, analysis and training for clickstream tools like Google Analytics and Google Analytics 4 and Google Tag Manager within a Salesforce Commerce Cloud environmentn• Evaluate the business goals and objectives from multiple eCommerce channels and develop tracking/tagging strategies to allow the teams to measure success of marketing activity.n• Design, develop, configure, and support the web analytics instrumentation architecture including Google Analytics, Google Analytics 4 (GA4), Google Tag Manager, etc. environment and processesn• Work with the marketing and product team to install correct site tags and data layers as well as assist migrating the data out of Tealium onto Google Analytics and Google Analytics 4.n• Serve as the primary lead on the development of implementation documentation our developers that details page code requirements and data layer details to support stated business analysis needs.n• Lead and/or perform quality assurance tests on tracking implementationn• Assist with tracking and improving results for marketing campaignsn• Partner with multiple business units within the Digital Marketing team as well as their Marketing Agenciesn• Conduct training and knowledge transfer to the Digital Marketing teamnnRequirements:n• Minimum 3-4 years on-the-job experience implementing web analytics tools using Google Analyticsn• Minimum 7 years’ experience with web developmentn• Advanced understanding of HTML, CSS, JavaScriptn• Must have strong knowledge in Google Analytics and Google Analytics 4 Data Layeringn• Knowledge of Salesforce Commerce Cloud and Tableaun• Highly skilled in analyzing complex data set for analytics, digital analytics, remarketing, conversion tracking, and custom reportingn• Proven record of data integrity and accuracy of tagging environment by continually monitoring tags, both manually and with automated toolsn• Maintain excellent tagging documentation for all analytics platformsn• Strong English communication skills (written and verbal)n• Comfortable presenting findings to a large audiencennCompany DescriptionThe name Syndicatebleu is synonymous with passion, innovation, and expertise. As one of the top creative staffing agencies in the country, we place highly skilled professionals in direct-hire, temp-to-hire, and contract roles at some of the best companies nationwide across a diverse range of industries. We specialize in staffing creative, digital, marketing, tech, and sales talent, and our recruiters are experts at understanding the unique needs and nuances of the dynamic creative job market. No matter your industry or discipline, let us design your perfect match",
    "job_highlights": [
      {
        "title": "Qualifications",
        "items": [
          "Minimum 3-4 years on-the-job experience implementing web analytics tools using Google Analytics",
          "Minimum 7 years’ experience with web development",
          "Advanced understanding of HTML, CSS, JavaScript",
          "Must have strong knowledge in Google Analytics and Google Analytics 4 Data Layering",
          "Knowledge of Salesforce Commerce Cloud and Tableau",
          "Highly skilled in analyzing complex data set for analytics, digital analytics, remarketing, conversion tracking, and custom reporting",
          "Proven record of data integrity and accuracy of tagging environment by continually monitoring tags, both manually and with automated tools",
          "Maintain excellent tagging documentation for all analytics platforms",
          "Strong English communication skills (written and verbal)",
          "Comfortable presenting findings to a large audience"
        ]
      },
      #...
  other results ...
]

You can check out the Scrape Google Jobs organic results with Python blog post if you need more code explanation.

Disclaimer, I work for SerpApi.

Answered By: Denis Skopa