why get the html content.txt is empty?
Question:
The target of the program is simple to get the headline of tageschau.de.
It normal at first, but it can get nothing after a few runs.
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
'AppleWebKit/537.36 (KHTML, like Gecko)'
'Chrome/86.0.4240.111 Safari/537.36',
'Host': 'www.tagesschau.de',
'Referer': 'https://www.tagesschau.de/'
}
# get and parse the HTML of tageschau.de
URL = 'https://www.tagesschau.de/'
html = requests.get(URL, headers=headers)
html_parse = BeautifulSoup(html.content, 'lxml')
# find all headline in homepage
elements = html_parse.find_all('h4',{'class':'headline'})
for element in elements:
print(element.txt)
It got nothing.
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
But when I use element
instead of element.txt
, there are some right output
<h4 class="headline"><a href="/multimedia/livestreams/livestream3/">Live: tagesschau24</a></h4>
<h4 class="headline"><a href="/100sekunden/">100 Sekunden</a></h4>
<h4 class="headline"><a href="/multimedia/sendung/ts-39833.html">tagesschau 20 Uhr</a></h4>
<h4 class="headline"><a href="/multimedia/sendung/ts-39841.html">Letzte Sendung</a></h4>
<h4 class="headline">++ Fauci warnt vor "einer Menge Leid" ++</h4>
<h4 class="headline">Weniger Party, mehr Wellness</h4>
<h4 class="headline">November-Lockdown kostet 19 Milliarden</h4>
It makes me so confused, why?
Answers:
If you want to get the innertext of element try .text
:
for element in elements:
print(element.text)
For innerHTML use .html
:
for element in elements:
print(element.html)
The target of the program is simple to get the headline of tageschau.de.
It normal at first, but it can get nothing after a few runs.
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
'AppleWebKit/537.36 (KHTML, like Gecko)'
'Chrome/86.0.4240.111 Safari/537.36',
'Host': 'www.tagesschau.de',
'Referer': 'https://www.tagesschau.de/'
}
# get and parse the HTML of tageschau.de
URL = 'https://www.tagesschau.de/'
html = requests.get(URL, headers=headers)
html_parse = BeautifulSoup(html.content, 'lxml')
# find all headline in homepage
elements = html_parse.find_all('h4',{'class':'headline'})
for element in elements:
print(element.txt)
It got nothing.
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
But when I use element
instead of element.txt
, there are some right output
<h4 class="headline"><a href="/multimedia/livestreams/livestream3/">Live: tagesschau24</a></h4>
<h4 class="headline"><a href="/100sekunden/">100 Sekunden</a></h4>
<h4 class="headline"><a href="/multimedia/sendung/ts-39833.html">tagesschau 20 Uhr</a></h4>
<h4 class="headline"><a href="/multimedia/sendung/ts-39841.html">Letzte Sendung</a></h4>
<h4 class="headline">++ Fauci warnt vor "einer Menge Leid" ++</h4>
<h4 class="headline">Weniger Party, mehr Wellness</h4>
<h4 class="headline">November-Lockdown kostet 19 Milliarden</h4>
It makes me so confused, why?
If you want to get the innertext of element try .text
:
for element in elements:
print(element.text)
For innerHTML use .html
:
for element in elements:
print(element.html)