How to filter tag without an attribute in find_all() function in Beautifulsoup?
Question:
Below are a simple html source code I’m working with
<html>
<head>
<title>Welcome to the comments assignment from www.py4e.com</title>
</head>
<body>
<h1>This file contains the actual data for your assignment - good luck!</h1>
<table border="2">
<tr>
<td>Name</td><td>Comments</td>
</tr>
<tr><td>Melodie</td><td><span class="comments">100</span></td></tr>
<tr><td>Machaela</td><td><span class="comments">100</span></td></tr>
<tr><td>Rhoan</td><td><span class="comments">99</span></td></tr>
Below is my code try to get the <td>Melodie</td>
line
html='html text file aboved'
soup=BeautifulSoup(html,'html.parser')
for tag in soup.find_all('td'):
print(tag)
print('----') #Result:
#===============================================================================
# <td>Name</td>
# ----
# <td>Comments</td>
# ----
# <td>Melodie</td>
# ----
# <td><span class="comments">100</span></td>
# ----
# <td>Machaela</td>
# ----
# <td><span class="comments">100</span></td>
# ----
# <td>Rhoan</td>
# ----
#.........
#===============================================================================
Now I want to get the <td>name<td>
lines only and not the line with ‘span’ and ‘class’. I try 2 filters soup.find_all('td' and not 'span')
and soup.find_all('td', attrs={'class':None})
but none of those work. I know there is other way around but I want to use the filter in soup.find_all().
My expected output (actually my final goal is to get the name of person between two <td>
):
# <td>Name</td>
# ----
# <td>Comments</td>
# ----
# <td>Melodie</td>
# ----
# <td>Machaela</td>
# ----
# <td>Rhoan</td>
# ----
Answers:
Select your elements via css selectors
e.g. nest pseudo classes :has()
and :not()
:
soup.select('td:not(:has(span))')
or
soup.select('td:not(:has(.comments))')
Example
from bs4 import BeautifulSoup
html=urllib.request.urlopen('http://py4e-data.dr-chuck.net/comments_1430669.html').read()
soup=BeautifulSoup(html,'html.parser')
for e in soup.select('td:not(:has(span))'):
print(e)
Output
<td>Name</td>
<td>Comments</td>
<td>Melodie</td>
<td>Machaela</td>
<td>Rhoan</td>
<td>Murrough</td>
<td>Lilygrace</td>
...
You can get the desired output with two separate selector calls:
from bs4 import BeautifulSoup
html = """
<body>
<table border="2">
<tr>
<td>Name</td><td>Comments</td>
</tr>
<tr><td>Melodie</td><td><span class="comments">100</span></td></tr>
<tr><td>Machaela</td><td><span class="comments">100</span></td></tr>
<tr><td>Rhoan</td><td><span class="comments">99</span></td></tr>
"""
soup = BeautifulSoup(html, "lxml")
for elem in soup.select("td"):
if not elem.select(".comments"):
print(elem)
Output:
<td>Name</td>
<td>Comments</td>
<td>Melodie</td>
<td>Machaela</td>
<td>Rhoan</td>
As an aside, prefer lxml to html.parser. It’s faster and more robust to malformed HTML.
I know it has been 12 months since the question was posted, but I hope this can help those who will come after us. I have tried and tried to find the most concise code for a beginner like me. Here it is:
#Creating the veariables
soup = BeautifulSoup(html, "html.parser")
my_list = list()
#Asking BeautifulSoup to find all <td> tags that contains strings only with lettes (a-zA-z)
names = soup.find_all("td", string = re.compile("[a-zA-Z]"))
for name in names:
my_list.append(name)
print(name)
print(my_list)
Below are a simple html source code I’m working with
<html>
<head>
<title>Welcome to the comments assignment from www.py4e.com</title>
</head>
<body>
<h1>This file contains the actual data for your assignment - good luck!</h1>
<table border="2">
<tr>
<td>Name</td><td>Comments</td>
</tr>
<tr><td>Melodie</td><td><span class="comments">100</span></td></tr>
<tr><td>Machaela</td><td><span class="comments">100</span></td></tr>
<tr><td>Rhoan</td><td><span class="comments">99</span></td></tr>
Below is my code try to get the <td>Melodie</td>
line
html='html text file aboved'
soup=BeautifulSoup(html,'html.parser')
for tag in soup.find_all('td'):
print(tag)
print('----') #Result:
#===============================================================================
# <td>Name</td>
# ----
# <td>Comments</td>
# ----
# <td>Melodie</td>
# ----
# <td><span class="comments">100</span></td>
# ----
# <td>Machaela</td>
# ----
# <td><span class="comments">100</span></td>
# ----
# <td>Rhoan</td>
# ----
#.........
#===============================================================================
Now I want to get the <td>name<td>
lines only and not the line with ‘span’ and ‘class’. I try 2 filters soup.find_all('td' and not 'span')
and soup.find_all('td', attrs={'class':None})
but none of those work. I know there is other way around but I want to use the filter in soup.find_all().
My expected output (actually my final goal is to get the name of person between two <td>
):
# <td>Name</td>
# ----
# <td>Comments</td>
# ----
# <td>Melodie</td>
# ----
# <td>Machaela</td>
# ----
# <td>Rhoan</td>
# ----
Select your elements via css selectors
e.g. nest pseudo classes :has()
and :not()
:
soup.select('td:not(:has(span))')
or
soup.select('td:not(:has(.comments))')
Example
from bs4 import BeautifulSoup
html=urllib.request.urlopen('http://py4e-data.dr-chuck.net/comments_1430669.html').read()
soup=BeautifulSoup(html,'html.parser')
for e in soup.select('td:not(:has(span))'):
print(e)
Output
<td>Name</td>
<td>Comments</td>
<td>Melodie</td>
<td>Machaela</td>
<td>Rhoan</td>
<td>Murrough</td>
<td>Lilygrace</td>
...
You can get the desired output with two separate selector calls:
from bs4 import BeautifulSoup
html = """
<body>
<table border="2">
<tr>
<td>Name</td><td>Comments</td>
</tr>
<tr><td>Melodie</td><td><span class="comments">100</span></td></tr>
<tr><td>Machaela</td><td><span class="comments">100</span></td></tr>
<tr><td>Rhoan</td><td><span class="comments">99</span></td></tr>
"""
soup = BeautifulSoup(html, "lxml")
for elem in soup.select("td"):
if not elem.select(".comments"):
print(elem)
Output:
<td>Name</td>
<td>Comments</td>
<td>Melodie</td>
<td>Machaela</td>
<td>Rhoan</td>
As an aside, prefer lxml to html.parser. It’s faster and more robust to malformed HTML.
I know it has been 12 months since the question was posted, but I hope this can help those who will come after us. I have tried and tried to find the most concise code for a beginner like me. Here it is:
#Creating the veariables
soup = BeautifulSoup(html, "html.parser")
my_list = list()
#Asking BeautifulSoup to find all <td> tags that contains strings only with lettes (a-zA-z)
names = soup.find_all("td", string = re.compile("[a-zA-Z]"))
for name in names:
my_list.append(name)
print(name)
print(my_list)