Parse JavaScript array with empty elements using bs4
Question:
I am trying to parse this javascript element using BS4.
I want to get that input array into a usable format.
<script type="text/javascript">
require.config.params['matchheader'] = {
input: [162,13,'Crystal Palace','Arsenal','05/08/2022 20:00:00','05/08/2022 00:00:00',6,'FT','0 : 1','0 : 2',,,'0 : 2','England','England']
,
matchId: 1640674
};
</script>
To get the text inside the global variable, I used the following regex:
re.search("input: [.*?]", script_element.string).group(0)
which returns:
"input: [162,13,'Crystal Palace','Arsenal','05/08/2022 20:00:00','05/08/2022 00:00:00',6,'FT','0 : 1','0 : 2',,,'0 : 2','England','England']"
I am having some trouble parsing this array because of the empty elements (literal_eval
does not work).
Any idea on how to accomplish this? Is there an easier way to do it?
Regards
Answers:
One solution could be insert None
between the empty ,
and then parse it:
import re
from ast import literal_eval
data = re.search(r"input:s*(.*)", s).group(1) # <-- `s` is your string from the question
data = re.sub(r"(?<=,)s*(?=,)", "None", data)
data = literal_eval(data)
print(data)
Prints:
[162, 13, 'Crystal Palace', 'Arsenal', '05/08/2022 20:00:00', '05/08/2022 00:00:00', 6, 'FT', '0 : 1', '0 : 2', None, None, '0 : 2', 'England', 'England']
You could do some string manipulation and do a .split(',')
on the string to create a list.
import re
var_to_parse = """
<script type="text/javascript">
require.config.params['matchheader'] = {
input: [162,13,'Crystal Palace','Arsenal','05/08/2022 20:00:00','05/08/2022 00:00:00',6,'FT','0 : 1','0 : 2',,,'0 : 2','England','England']
,
matchId: 1640674
};
</script>
"""
parse1 = re.search("input: [.*?]", var_to_parse).group(0)
parse2 = parse1.split("input: ")[1]
parse3 = parse2[1:-1]
parse4 = parse3.split(',')
print("parse4:", parse4)
I am trying to parse this javascript element using BS4.
I want to get that input array into a usable format.
<script type="text/javascript">
require.config.params['matchheader'] = {
input: [162,13,'Crystal Palace','Arsenal','05/08/2022 20:00:00','05/08/2022 00:00:00',6,'FT','0 : 1','0 : 2',,,'0 : 2','England','England']
,
matchId: 1640674
};
</script>
To get the text inside the global variable, I used the following regex:
re.search("input: [.*?]", script_element.string).group(0)
which returns:
"input: [162,13,'Crystal Palace','Arsenal','05/08/2022 20:00:00','05/08/2022 00:00:00',6,'FT','0 : 1','0 : 2',,,'0 : 2','England','England']"
I am having some trouble parsing this array because of the empty elements (literal_eval
does not work).
Any idea on how to accomplish this? Is there an easier way to do it?
Regards
One solution could be insert None
between the empty ,
and then parse it:
import re
from ast import literal_eval
data = re.search(r"input:s*(.*)", s).group(1) # <-- `s` is your string from the question
data = re.sub(r"(?<=,)s*(?=,)", "None", data)
data = literal_eval(data)
print(data)
Prints:
[162, 13, 'Crystal Palace', 'Arsenal', '05/08/2022 20:00:00', '05/08/2022 00:00:00', 6, 'FT', '0 : 1', '0 : 2', None, None, '0 : 2', 'England', 'England']
You could do some string manipulation and do a .split(',')
on the string to create a list.
import re
var_to_parse = """
<script type="text/javascript">
require.config.params['matchheader'] = {
input: [162,13,'Crystal Palace','Arsenal','05/08/2022 20:00:00','05/08/2022 00:00:00',6,'FT','0 : 1','0 : 2',,,'0 : 2','England','England']
,
matchId: 1640674
};
</script>
"""
parse1 = re.search("input: [.*?]", var_to_parse).group(0)
parse2 = parse1.split("input: ")[1]
parse3 = parse2[1:-1]
parse4 = parse3.split(',')
print("parse4:", parse4)