Web scraping a p tag without a class using Bs4 and Selenium
Question:
I’m trying to web scrape this ->
The HTML has a div tag with a class. in this div tag there is another div tag and there is another p tag with no class. My goal is to specifically get that lone p tag without the class and get the text data from it.
So far this is my code ->
I did not include some imports and other parts of my code.
html = driver.page_source
time.sleep(.1)
soup = bs.BeautifulSoup(html, 'lxml')
time.sleep(.1)
Class_Details = soup.find_all("div", {"class":"row-fluid data_row primary-row class-info class-not-checked"})
for class_detail in Class_Details:
Class_status = class_detail.find_all("div", {"class":"statusColumn"})
Status = Class_status[0].text
class_date = class_detail.find_all("p",{"class":"hide-above-small beforeCollapseShow"})
class_time = class_date[0].text
The 4 lines above can be ignored they work and accomplish their tasks, the lines below however do not and is what I am asking.
cla = class_detail.find_all("p",{"class":"timeColumn"})
print(cla)
The Output of print(cla) is
[]
[]
[]
[]
[]
[]
[]
The good thing is that there are 7 empty lists which do coincide with the websites so it definitely is counting/ sensing the part I am scraping however I need the output to be text.
I hope I have been clear in my question and thank you for your time.
Answers:
The reason your output is not printing is because you are trying to print an element, not element text. You should change your code to the following:
cla = class_detail.find_all("p",{"class":"timeColumn"})
for item in cla:
print(item.text)
I know you are using BeautifulSoup, but I will also provide a solution using Selenium / XPath in case you do not find a BS implementation to your liking:
elements_list = driver.find_elements_by_xpath("//div[@class='timeColumn'/p]")
for element in elements_list:
print(element.text)
To get p tag without class use a CSS-selector for p
combined with the negation pseudo-class :not()
.
Here, the CSS-selector could be .timeColumn p:not([class])
:
# select_one to get first one
p_no_class = class_detail.select_one(".timeColumn p:not([class])").text
print(p_no_class)
# select to get all
all_p_no_class = class_detail.select(".timeColumn p:not([class])")
for p in all_p_no_class:
print(p.text)
See also CSS selector for not having classes.
The desired element is a JavaScript enabled element so to extract the text 7:45am-10:50am the desired element you have to induce WebDriverWait for the visibility_of_element_located()
and you can use either of the following Locator Strategies:
-
Using XPATH
:
print(WebDriverWait(driver, 30).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "//div[@class='timeColumn']/div[contains(@id, 'days_data')]/p/a[@class='popover-bottom' and text()='F']//following::p[1]"))).text)
-
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
I’m trying to web scrape this ->
The HTML has a div tag with a class. in this div tag there is another div tag and there is another p tag with no class. My goal is to specifically get that lone p tag without the class and get the text data from it.
So far this is my code ->
I did not include some imports and other parts of my code.
html = driver.page_source
time.sleep(.1)
soup = bs.BeautifulSoup(html, 'lxml')
time.sleep(.1)
Class_Details = soup.find_all("div", {"class":"row-fluid data_row primary-row class-info class-not-checked"})
for class_detail in Class_Details:
Class_status = class_detail.find_all("div", {"class":"statusColumn"})
Status = Class_status[0].text
class_date = class_detail.find_all("p",{"class":"hide-above-small beforeCollapseShow"})
class_time = class_date[0].text
The 4 lines above can be ignored they work and accomplish their tasks, the lines below however do not and is what I am asking.
cla = class_detail.find_all("p",{"class":"timeColumn"})
print(cla)
The Output of print(cla) is
[]
[]
[]
[]
[]
[]
[]
The good thing is that there are 7 empty lists which do coincide with the websites so it definitely is counting/ sensing the part I am scraping however I need the output to be text.
I hope I have been clear in my question and thank you for your time.
The reason your output is not printing is because you are trying to print an element, not element text. You should change your code to the following:
cla = class_detail.find_all("p",{"class":"timeColumn"})
for item in cla:
print(item.text)
I know you are using BeautifulSoup, but I will also provide a solution using Selenium / XPath in case you do not find a BS implementation to your liking:
elements_list = driver.find_elements_by_xpath("//div[@class='timeColumn'/p]")
for element in elements_list:
print(element.text)
To get p tag without class use a CSS-selector for p
combined with the negation pseudo-class :not()
.
Here, the CSS-selector could be .timeColumn p:not([class])
:
# select_one to get first one
p_no_class = class_detail.select_one(".timeColumn p:not([class])").text
print(p_no_class)
# select to get all
all_p_no_class = class_detail.select(".timeColumn p:not([class])")
for p in all_p_no_class:
print(p.text)
See also CSS selector for not having classes.
The desired element is a JavaScript enabled element so to extract the text 7:45am-10:50am the desired element you have to induce WebDriverWait for the visibility_of_element_located()
and you can use either of the following Locator Strategies:
-
Using
XPATH
:print(WebDriverWait(driver, 30).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "//div[@class='timeColumn']/div[contains(@id, 'days_data')]/p/a[@class='popover-bottom' and text()='F']//following::p[1]"))).text)
-
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC