Extracting an attribute value with beautifulsoup
Question:
I am trying to extract the content of a single "value" attribute in a specific "input" tag on a webpage. I use the following code:
import urllib
f = urllib.urlopen("http://58.68.130.147")
s = f.read()
f.close()
from BeautifulSoup import BeautifulStoneSoup
soup = BeautifulStoneSoup(s)
inputTag = soup.findAll(attrs={"name" : "stainfo"})
output = inputTag['value']
print str(output)
I get TypeError: list indices must be integers, not str
Even though, from the Beautifulsoup documentation, I understand that strings should not be a problem here… but I am no specialist, and I may have misunderstood.
Any suggestion is greatly appreciated!
Answers:
.find_all()
returns list of all found elements, so:
input_tag = soup.find_all(attrs={"name" : "stainfo"})
input_tag
is a list (probably containing only one element). Depending on what you want exactly you either should do:
output = input_tag[0]['value']
or use .find()
method which returns only one (first) found element:
input_tag = soup.find(attrs={"name": "stainfo"})
output = input_tag['value']
I would actually suggest you a time saving way to go with this assuming that you know what kind of tags have those attributes.
suppose say a tag xyz has that attritube named “staininfo”..
full_tag = soup.findAll("xyz")
And i wan’t you to understand that full_tag is a list
for each_tag in full_tag:
staininfo_attrb_value = each_tag["staininfo"]
print staininfo_attrb_value
Thus you can get all the attrb values of staininfo for all the tags xyz
If you want to retrieve multiple values of attributes from the source above, you can use findAll
and a list comprehension to get everything you need:
import urllib
f = urllib.urlopen("http://58.68.130.147")
s = f.read()
f.close()
from BeautifulSoup import BeautifulStoneSoup
soup = BeautifulStoneSoup(s)
inputTags = soup.findAll(attrs={"name" : "stainfo"})
### You may be able to do findAll("input", attrs={"name" : "stainfo"})
output = [x["stainfo"] for x in inputTags]
print output
### This will print a list of the values.
In Python 3.x
, simply use get(attr_name)
on your tag object that you get using find_all
:
xmlData = None
with open('conf//test1.xml', 'r') as xmlFile:
xmlData = xmlFile.read()
xmlDecoded = xmlData
xmlSoup = BeautifulSoup(xmlData, 'html.parser')
repElemList = xmlSoup.find_all('repeatingelement')
for repElem in repElemList:
print("Processing repElem...")
repElemID = repElem.get('id')
repElemName = repElem.get('name')
print("Attribute id = %s" % repElemID)
print("Attribute name = %s" % repElemName)
against XML file conf//test1.xml
that looks like:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<root>
<singleElement>
<subElementX>XYZ</subElementX>
</singleElement>
<repeatingElement id="11" name="Joe"/>
<repeatingElement id="12" name="Mary"/>
</root>
prints:
Processing repElem...
Attribute id = 11
Attribute name = Joe
Processing repElem...
Attribute id = 12
Attribute name = Mary
you can also use this :
import requests
from bs4 import BeautifulSoup
import csv
url = "http://58.68.130.147/"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
get_details = soup.find_all("input", attrs={"name":"stainfo"})
for val in get_details:
get_val = val["value"]
print(get_val)
I am using this with Beautifulsoup 4.8.1 to get the value of all class attributes of certain elements:
from bs4 import BeautifulSoup
html = "<td class='val1'/><td col='1'/><td class='val2' />"
bsoup = BeautifulSoup(html, 'html.parser')
for td in bsoup.find_all('td'):
if td.has_attr('class'):
print(td['class'][0])
Its important to note that the attribute key retrieves a list even when the attribute has only a single value.
For me:
<input id="color" value="Blue"/>
This can be fetched by below snippet.
page = requests.get("https://www.abcd.com")
soup = BeautifulSoup(page.content, 'html.parser')
colorName = soup.find(id='color')
print(colorName['value'])
You could try to use the new powerful package called requests_html:
from requests_html import HTMLSession
session = HTMLSession()
r = session.get("https://www.bbc.co.uk/news/technology-54448223")
date = r.html.find('time', first = True) # finding a "tag" called "time"
print(date) # you will have: <Element 'time' datetime='2020-10-07T11:41:22.000Z'>
# To get the text inside the "datetime" attribute use:
print(date.attrs['datetime']) # you will get '2020-10-07T11:41:22.000Z'
You can try gazpacho:
Install it using pip install gazpacho
Get the HTML and make the Soup
using:
from gazpacho import get, Soup
soup = Soup(get("http://ip.add.ress.here/")) # get directly returns the html
inputs = soup.find('input', attrs={'name': 'stainfo'}) # Find all the input tags
if inputs:
if type(inputs) is list:
for input in inputs:
print(input.attr.get('value'))
else:
print(inputs.attr.get('value'))
else:
print('No <input> tag found with the attribute name="stainfo")
Here is an example for how to extract the href
attrbiutes of all a
tags:
import requests as rq
from bs4 import BeautifulSoup as bs
url = "http://www.cde.ca.gov/ds/sp/ai/"
page = rq.get(url)
html = bs(page.text, 'lxml')
hrefs = html.find_all("a")
all_hrefs = []
for href in hrefs:
# print(href.get("href"))
links = href.get("href")
all_hrefs.append(links)
print(all_hrefs)
I am trying to extract the content of a single "value" attribute in a specific "input" tag on a webpage. I use the following code:
import urllib
f = urllib.urlopen("http://58.68.130.147")
s = f.read()
f.close()
from BeautifulSoup import BeautifulStoneSoup
soup = BeautifulStoneSoup(s)
inputTag = soup.findAll(attrs={"name" : "stainfo"})
output = inputTag['value']
print str(output)
I get TypeError: list indices must be integers, not str
Even though, from the Beautifulsoup documentation, I understand that strings should not be a problem here… but I am no specialist, and I may have misunderstood.
Any suggestion is greatly appreciated!
.find_all()
returns list of all found elements, so:
input_tag = soup.find_all(attrs={"name" : "stainfo"})
input_tag
is a list (probably containing only one element). Depending on what you want exactly you either should do:
output = input_tag[0]['value']
or use .find()
method which returns only one (first) found element:
input_tag = soup.find(attrs={"name": "stainfo"})
output = input_tag['value']
I would actually suggest you a time saving way to go with this assuming that you know what kind of tags have those attributes.
suppose say a tag xyz has that attritube named “staininfo”..
full_tag = soup.findAll("xyz")
And i wan’t you to understand that full_tag is a list
for each_tag in full_tag:
staininfo_attrb_value = each_tag["staininfo"]
print staininfo_attrb_value
Thus you can get all the attrb values of staininfo for all the tags xyz
If you want to retrieve multiple values of attributes from the source above, you can use findAll
and a list comprehension to get everything you need:
import urllib
f = urllib.urlopen("http://58.68.130.147")
s = f.read()
f.close()
from BeautifulSoup import BeautifulStoneSoup
soup = BeautifulStoneSoup(s)
inputTags = soup.findAll(attrs={"name" : "stainfo"})
### You may be able to do findAll("input", attrs={"name" : "stainfo"})
output = [x["stainfo"] for x in inputTags]
print output
### This will print a list of the values.
In Python 3.x
, simply use get(attr_name)
on your tag object that you get using find_all
:
xmlData = None
with open('conf//test1.xml', 'r') as xmlFile:
xmlData = xmlFile.read()
xmlDecoded = xmlData
xmlSoup = BeautifulSoup(xmlData, 'html.parser')
repElemList = xmlSoup.find_all('repeatingelement')
for repElem in repElemList:
print("Processing repElem...")
repElemID = repElem.get('id')
repElemName = repElem.get('name')
print("Attribute id = %s" % repElemID)
print("Attribute name = %s" % repElemName)
against XML file conf//test1.xml
that looks like:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<root>
<singleElement>
<subElementX>XYZ</subElementX>
</singleElement>
<repeatingElement id="11" name="Joe"/>
<repeatingElement id="12" name="Mary"/>
</root>
prints:
Processing repElem...
Attribute id = 11
Attribute name = Joe
Processing repElem...
Attribute id = 12
Attribute name = Mary
you can also use this :
import requests
from bs4 import BeautifulSoup
import csv
url = "http://58.68.130.147/"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
get_details = soup.find_all("input", attrs={"name":"stainfo"})
for val in get_details:
get_val = val["value"]
print(get_val)
I am using this with Beautifulsoup 4.8.1 to get the value of all class attributes of certain elements:
from bs4 import BeautifulSoup
html = "<td class='val1'/><td col='1'/><td class='val2' />"
bsoup = BeautifulSoup(html, 'html.parser')
for td in bsoup.find_all('td'):
if td.has_attr('class'):
print(td['class'][0])
Its important to note that the attribute key retrieves a list even when the attribute has only a single value.
For me:
<input id="color" value="Blue"/>
This can be fetched by below snippet.
page = requests.get("https://www.abcd.com")
soup = BeautifulSoup(page.content, 'html.parser')
colorName = soup.find(id='color')
print(colorName['value'])
You could try to use the new powerful package called requests_html:
from requests_html import HTMLSession
session = HTMLSession()
r = session.get("https://www.bbc.co.uk/news/technology-54448223")
date = r.html.find('time', first = True) # finding a "tag" called "time"
print(date) # you will have: <Element 'time' datetime='2020-10-07T11:41:22.000Z'>
# To get the text inside the "datetime" attribute use:
print(date.attrs['datetime']) # you will get '2020-10-07T11:41:22.000Z'
You can try gazpacho:
Install it using pip install gazpacho
Get the HTML and make the Soup
using:
from gazpacho import get, Soup
soup = Soup(get("http://ip.add.ress.here/")) # get directly returns the html
inputs = soup.find('input', attrs={'name': 'stainfo'}) # Find all the input tags
if inputs:
if type(inputs) is list:
for input in inputs:
print(input.attr.get('value'))
else:
print(inputs.attr.get('value'))
else:
print('No <input> tag found with the attribute name="stainfo")
Here is an example for how to extract the href
attrbiutes of all a
tags:
import requests as rq
from bs4 import BeautifulSoup as bs
url = "http://www.cde.ca.gov/ds/sp/ai/"
page = rq.get(url)
html = bs(page.text, 'lxml')
hrefs = html.find_all("a")
all_hrefs = []
for href in hrefs:
# print(href.get("href"))
links = href.get("href")
all_hrefs.append(links)
print(all_hrefs)