Python: BeautifulSoup select_one cannot find the tag

Question:

English is my second language, please excuse me for poor English.

Follow code is an easy code that gets tag info with using requests and bs4.
The problem is, this code is returning none.

import requests
from bs4 import BeautifulSoup

url = 'http://ch1.skbroadband.com/content/view?parent_no=24&content_no=57&p_no=154494'

web = requests.get(url,headers={'User-Agent':'Mozilla/5.0'})
source = BeautifulSoup(web.text,'html.parser')

source = source.select_one('body > div.wrapper > div.container > div.contentBox > div > div.wrap_content_view > div.content_s')

print(source)

To be specific, if I fix the selector to this: (div.contents_s is removed)

'body > div.wrapper > div.container > div.contentBox > div > div.wrap_content_view'

This selector is returning

<div class="content_metadata">
<dl class="dl_meatadata_wrap">
<!--1개-->
<dd class="content-single content-news">        
<div class="content_title">
<h4 class="h4_title">[B tv 서울뉴스] 한강 자전거도로 78km 전면 개선…2024년 완료 목표 </h4>      
<span class="date">2023-03-23 18:01:16</span>   
<button class="btn-share"></button>
<div class="sns_layer" style="">
<ul>
<li><a class="sns_facebook" href="javascript:;" onclick="product.snsShareFacebook();">페이스북  공유</a></li>
<li><a class="sns_twitter" href="javascript:;" onclick="product.snsShareTwitter();">트위터 공유</a></li>
<li style="width:20px;"><a class="sns_naver_blog" href="javascript:;" onclick="product.snsShareNaverblog();"></a></li>
</ul>
</div>
</div>
</dd>
<!-- 2019-05-09: youtube 분기 코드반영 -->      
<iframe allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen="" class="wcd_p" frameborder="0" height="439" src="https://www.youtube.com/embed/UkZcGsiVTWo" width="780"></iframe>
<h5>더 많은 우리동네 B tv 소식은 “ch1.skbroadband.com”에서 보실수 있습니다</h5>
<div class="middle_btn">
<button class="tbtn font-s" id="btn_f_size" style="float:right;margin-top:15px;">+ 크게</button>
</div>
<!--시리즈, 프로그램-->
</dl>
<div style="clear:both;"></div>
</div>
<p>
<div class="content_s">
            [B tv 서울뉴스 김진중 기자]<br/><br/>[기사내용]<br/>안전사고 예방과 이용자 편의를 위해<br/>서울 한강 자전거 도로의 전면 개선이 추진 됩니다. <br/><br/>서울시한강사업본부는 <br/>강· 남북 78km길이의 한강 자전거 도로를 <br/>보행자와 자전거 이용자 안전성을 높이는 방향으로 개선합니다. <br/><br/>또, 자전거 쉼터와 노을 전망대 등의 시설을 확충합니다.<br/><br/>올해에는 11개 한강 공원 중 <br/>강서, 양화, 여의도, 잠실, 잠원 등 <br/>5곳의 자전거도로가<br/><br/>내년에는 반포와 광나루, <br/>난지, 망원과 이촌, 뚝섬의 <br/>자전거 도로 개선 공사가 이뤄집니다.<br/><br/>(김진중 기자ㅣ[email protected])<br/>(영상편집ㅣ이기태  기자)<br/><br/><br/>(2023년 3월 23일 방송분)<br/><br/>▣ B tv 서울뉴스 제보하기<br/>채널ID: 'btv 서울제보' 추가하여 채팅<br/>페이스북: 'SK브로드 밴드 서울방송' 검색하여 메시지 전송<br/>전화: 1670-0035<br/><br/>▣ 뉴스 시간 안내<br/>[뉴스특보 / B tv 서울뉴스]<br/>평일 7시 / 11시 / 15시 / 19시 / 21시 / 23시<br/><br/>[주간종합뉴스]<br/>주 말 7시 / 11시 / 15시 / 19시 / 24시         </div>
</p>
</div>

As you can read above, there is tag named ‘content_s’.
I have no idea why it returns ‘None’ when I use the selector on my first code.

The web is not loaded with javascript, the web has only one ‘content_s’ tag and of cource, using select instead of select_one also not works.
Any solution to this kind of problem?

Asked By: mmd

||

Answers:

edit: Since it is not a javascript issue, the error could be because the desired elements is within a

tag which can result in a non strict parse tree which is a nested representation of the HTML tags and their content, in our case when BeautifulSoup encounter a

tag it might interpret it as a way that doest follow strictly the CSS selector you provided, because the

tag is an inline element, meaning it is used to style and structure text content within block-level elements, such as a .

lets modify your selector to find the `div.wrap_content_view` then search the div.`content_s` 

import requests
from bs4 import BeautifulSoup

url = 'http://ch1.skbroadband.com/content/view?parent_no=24&content_no=57&p_no=154494'

web = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
source = BeautifulSoup(web.text, 'html.parser')

# Find the div.wrap_content_view first
wrap_content_view = source.select_one('body > div.wrapper > div.container > div.contentBox > div > div.wrap_content_view')

# Then search within wrap_content_view for div.content_s
content_s = wrap_content_view.find('div', class_='content_s')

print(content_s)

Maybe the page is generated by Javascript despite what you think, if it is the case then you have to use a library like Selenium that render the page with Javascript before parsing the content with beautifulsoup.

First install it pip install selenium, add the appropriate WebDriver for the browser you want to use, instruction here: https://www.selenium.dev/documentation/en/webdriver/driver_requirements/

then you can use it like that, for example with chrome

from bs4 import BeautifulSoup
from selenium import webdriver

url = 'http://ch1.skbroadband.com/content/view?parent_no=24&content_no=57&p_no=154494'

# Set up the Selenium WebDriver (make sure to set the correct path to your WebDriver)
driver = webdriver.Chrome(executable_path='/path/to/chromedriver')

driver.get(url)

# Get the rendered HTML content after JavaScript has run
source = driver.page_source

# Parse the content with BeautifulSoup
soup = BeautifulSoup(source, 'html.parser')

# Find the desired element
element = soup.select_one('body > div.wrapper > div.container > div.contentBox > div > div.wrap_content_view > div.content_s')

print(element)

# Clean up and close the Selenium WebDriver
driver.quit()
Answered By: Saxtheowl