BeautifulSoup — extracting both "td" objects without class (_class = None or False) and other class types
Question:
I am trying to scrap from a website that has td
objects. Some of those have no class, which I can extract with
object.find_all("td", class_=None)
And others have a class called sem_dados
, which I can extract using
object.find_all("td", class_="sem_dados")
Main issue is: I can’t do both at the same time. For instance,
object.find_all("td", class_=[None, "sem_dados"])
will not return the td
objects that have no class. This seems to be a problem with the None
or False
behavior within a list, since
object.find_all("td", class_=[None])
Will also return an empty list.
Anyone knows how to change the syntax so I can call both together? The ordering of the extraction would be important. I could manually reorder, but I believe there must be a syntax to do what I am trying to do.
Tried many different syntaxes, but still couldn’t get something working.
Answers:
Maybe you can use custom lambda
function:
from bs4 import BeautifulSoup
html_doc = '''
<td class="sem_dados">I want this 1</td>
<td class="other">I don't want this</td>
<td>I want this 2</td>'''
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all('td', class_=lambda c: not c or 'sem_dados' == c))
Prints:
[<td class="sem_dados">I want this 1</td>, <td>I want this 2</td>]
I am trying to scrap from a website that has td
objects. Some of those have no class, which I can extract with
object.find_all("td", class_=None)
And others have a class called sem_dados
, which I can extract using
object.find_all("td", class_="sem_dados")
Main issue is: I can’t do both at the same time. For instance,
object.find_all("td", class_=[None, "sem_dados"])
will not return the td
objects that have no class. This seems to be a problem with the None
or False
behavior within a list, since
object.find_all("td", class_=[None])
Will also return an empty list.
Anyone knows how to change the syntax so I can call both together? The ordering of the extraction would be important. I could manually reorder, but I believe there must be a syntax to do what I am trying to do.
Tried many different syntaxes, but still couldn’t get something working.
Maybe you can use custom lambda
function:
from bs4 import BeautifulSoup
html_doc = '''
<td class="sem_dados">I want this 1</td>
<td class="other">I don't want this</td>
<td>I want this 2</td>'''
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all('td', class_=lambda c: not c or 'sem_dados' == c))
Prints:
[<td class="sem_dados">I want this 1</td>, <td>I want this 2</td>]