Convert BeautifulSoup tag to JSON
Question:
A part of the webpage html <class ‘bs4.element.Tag’> is like:
<script>
var id46 = {"columns": [{"id":"column0","label":"date","type":"string","role":"domain","pattern":""},{"id":"column1","label":"","type":"number","role":"data","pattern":"#0.###############"}],"rows":[{"c":[{"v":"August-12","f":null},{"v":7,"f":null}]},{"c":[{"v":"September-12","f":null},{"v":10,"f":null}]}]}],"p":{}}
;
</script>
I want to extract its content and convert it to JSON. I’ve tried both but neither works:
jsonData = json.loads(sss.attrs["var id46"])
jsonData = json.loads(sss.text)
What’s the right way to convert it to JSON?
Answers:
Try this, although it’s not clear it needs to be this way.
If I set your example to data and use it this way, you’ll see the .attrs
will be an empty dictionary {}
.
soup = BeautifulSoup(data, "html.parser")
soup.attrs
soup.find('script').attrs
But, if I pull out the text as string, you can get the dictionary by itself
s = soup.find(string=re.compile('columns'))
s_dict = s.split('var id46 = ')[1].strip().replace(';', '')
Now, when I tried to use json.loads(s_dict)
it gave me an error and so did eval(s_dict)
. There seems to be an unmatched bracket ]
Hopefully you just cut and paste incorrectly, but if you didn’t then you can use string methods to pull out the data you are looking for. I’ve used this method on some of my scrapes so I know it can work.
You can try using with regular expression although I suspect var id46
is not proper dictionary object.
<script type="text/javascript">
var id46 = {"columns": [{"id":"column0","label":"date","type":"string","role":"domain","pattern":""},{"id":"column1","label":"","type":"number","role":"data","pattern":"#0.###############"}],"rows":[{"c":[{"v":"August-12","f":null},{"v":7,"f":null}]},{"c":[{"v":"September-12","f":null},{"v":10,"f":null}]}]}
</script>
Find your script and apply re
to extract dictionary object.
script = soup.find('script')
pattern = re.compile('({.+})')
result = pattern.findall(str(script))
jsonData = json.loads(result[0])
print(jsonData)
You jsonData should look like this. Hopefully this helps.
{'columns': [{'id': 'column0',
'label': 'date',
'pattern': '',
'role': 'domain',
'type': 'string'},
{'id': 'column1',
'label': '',
'pattern': '#0.###############',
'role': 'data',
'type': 'number'}],
'rows': [{'c': [{'f': None, 'v': 'August-12'}, {'f': None, 'v': 7}]},
{'c': [{'f': None, 'v': 'September-12'}, {'f': None, 'v': 10}]}]}
A part of the webpage html <class ‘bs4.element.Tag’> is like:
<script>
var id46 = {"columns": [{"id":"column0","label":"date","type":"string","role":"domain","pattern":""},{"id":"column1","label":"","type":"number","role":"data","pattern":"#0.###############"}],"rows":[{"c":[{"v":"August-12","f":null},{"v":7,"f":null}]},{"c":[{"v":"September-12","f":null},{"v":10,"f":null}]}]}],"p":{}}
;
</script>
I want to extract its content and convert it to JSON. I’ve tried both but neither works:
jsonData = json.loads(sss.attrs["var id46"])
jsonData = json.loads(sss.text)
What’s the right way to convert it to JSON?
Try this, although it’s not clear it needs to be this way.
If I set your example to data and use it this way, you’ll see the .attrs
will be an empty dictionary {}
.
soup = BeautifulSoup(data, "html.parser")
soup.attrs
soup.find('script').attrs
But, if I pull out the text as string, you can get the dictionary by itself
s = soup.find(string=re.compile('columns'))
s_dict = s.split('var id46 = ')[1].strip().replace(';', '')
Now, when I tried to use json.loads(s_dict)
it gave me an error and so did eval(s_dict)
. There seems to be an unmatched bracket ]
Hopefully you just cut and paste incorrectly, but if you didn’t then you can use string methods to pull out the data you are looking for. I’ve used this method on some of my scrapes so I know it can work.
You can try using with regular expression although I suspect var id46
is not proper dictionary object.
<script type="text/javascript">
var id46 = {"columns": [{"id":"column0","label":"date","type":"string","role":"domain","pattern":""},{"id":"column1","label":"","type":"number","role":"data","pattern":"#0.###############"}],"rows":[{"c":[{"v":"August-12","f":null},{"v":7,"f":null}]},{"c":[{"v":"September-12","f":null},{"v":10,"f":null}]}]}
</script>
Find your script and apply re
to extract dictionary object.
script = soup.find('script')
pattern = re.compile('({.+})')
result = pattern.findall(str(script))
jsonData = json.loads(result[0])
print(jsonData)
You jsonData should look like this. Hopefully this helps.
{'columns': [{'id': 'column0',
'label': 'date',
'pattern': '',
'role': 'domain',
'type': 'string'},
{'id': 'column1',
'label': '',
'pattern': '#0.###############',
'role': 'data',
'type': 'number'}],
'rows': [{'c': [{'f': None, 'v': 'August-12'}, {'f': None, 'v': 7}]},
{'c': [{'f': None, 'v': 'September-12'}, {'f': None, 'v': 10}]}]}