Convert BeautifulSoup tag to JSON

Question:

A part of the webpage html <class ‘bs4.element.Tag’> is like:

<script>

                    var id46 = {"columns":    [{"id":"column0","label":"date","type":"string","role":"domain","pattern":""},{"id":"column1","label":"","type":"number","role":"data","pattern":"#0.###############"}],"rows":[{"c":[{"v":"August-12","f":null},{"v":7,"f":null}]},{"c":[{"v":"September-12","f":null},{"v":10,"f":null}]}]}],"p":{}}

;

                  </script>

I want to extract its content and convert it to JSON. I’ve tried both but neither works:

jsonData = json.loads(sss.attrs["var id46"])
jsonData = json.loads(sss.text)

What’s the right way to convert it to JSON?

Asked By: Mark K

||

Answers:

Try this, although it’s not clear it needs to be this way.

If I set your example to data and use it this way, you’ll see the .attrs will be an empty dictionary {}.

soup = BeautifulSoup(data, "html.parser")
soup.attrs
soup.find('script').attrs

But, if I pull out the text as string, you can get the dictionary by itself

s = soup.find(string=re.compile('columns'))
s_dict = s.split('var id46 = ')[1].strip().replace(';', '')

Now, when I tried to use json.loads(s_dict) it gave me an error and so did eval(s_dict). There seems to be an unmatched bracket ]

Hopefully you just cut and paste incorrectly, but if you didn’t then you can use string methods to pull out the data you are looking for. I’ve used this method on some of my scrapes so I know it can work.

Answered By: Jonathan Leon

You can try using with regular expression although I suspect var id46 is not proper dictionary object.

<script type="text/javascript">
    var id46 = {"columns": [{"id":"column0","label":"date","type":"string","role":"domain","pattern":""},{"id":"column1","label":"","type":"number","role":"data","pattern":"#0.###############"}],"rows":[{"c":[{"v":"August-12","f":null},{"v":7,"f":null}]},{"c":[{"v":"September-12","f":null},{"v":10,"f":null}]}]}
</script>

Find your script and apply re to extract dictionary object.

script = soup.find('script')

pattern = re.compile('({.+})')
result = pattern.findall(str(script))

jsonData = json.loads(result[0])

print(jsonData)

You jsonData should look like this. Hopefully this helps.


{'columns': [{'id': 'column0',
              'label': 'date',
              'pattern': '',
              'role': 'domain',
              'type': 'string'},
             {'id': 'column1',
              'label': '',
              'pattern': '#0.###############',
              'role': 'data',
              'type': 'number'}],
 'rows': [{'c': [{'f': None, 'v': 'August-12'}, {'f': None, 'v': 7}]},
          {'c': [{'f': None, 'v': 'September-12'}, {'f': None, 'v': 10}]}]}
Answered By: Suleiman
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.