How to parse the regex text between multiline and between two braces?
Question:
I am new to python and trying to learn the regex by example.
In this example I am trying the extract the dictionary parts from the multiline text.
How to extract the parts between the two braces in the following example?
MWE: How to get pandas dataframe from this data?
import re
s = """
[
{
specialty: "Anatomic/Clinical Pathology",
one: " 12,643 ",
two: " 8,711 ",
three: " 385 ",
four: " 520 ",
five: " 3,027 ",
},
{
specialty: "Nephrology",
one: " 11,407 ",
two: " 9,964 ",
three: " 140 ",
four: " 316 ",
five: " 987 ",
},
{
specialty: "Vascular Surgery",
one: " 3,943 ",
two: " 3,586 ",
three: " 48 ",
four: " 13 ",
five: " 296 ",
},
]
"""
m = re.match('({.*})', s, flags=re.S)
data = m.groups()
df = pd.DataFrame(data)
Answers:
I suggest to add double quotes around the keys, then cast the string to a list of dictionaries and then simply read the structure into pandas dataframe using pd.from_dict
:
import pandas as pd
from ast import literal_eval
import re
s = "YOU STRING HERE"
fixed_s = re.sub(r"^(s*)(w+):", r'1"2":', s, flags=re.M)
df = pd.DataFrame.from_dict( ast.literal_eval(fixed_s) )
The ^(s*)(w+):
regex matches zero or more whitespaces at the start of any line (see the flags=re.M
that makes ^
match start of any line positions) capturing them into Group 1, and then matches one or more word chars capturing them into Group 2 and then matches a :
and then replaces the match with Group 1 + "
+ Group 2 + ":
.
The result is cast to a list of dictionaries using ast.literal_eval
.
Then, the list is used to initialize the dataframe.
I am new to python and trying to learn the regex by example.
In this example I am trying the extract the dictionary parts from the multiline text.
How to extract the parts between the two braces in the following example?
MWE: How to get pandas dataframe from this data?
import re
s = """
[
{
specialty: "Anatomic/Clinical Pathology",
one: " 12,643 ",
two: " 8,711 ",
three: " 385 ",
four: " 520 ",
five: " 3,027 ",
},
{
specialty: "Nephrology",
one: " 11,407 ",
two: " 9,964 ",
three: " 140 ",
four: " 316 ",
five: " 987 ",
},
{
specialty: "Vascular Surgery",
one: " 3,943 ",
two: " 3,586 ",
three: " 48 ",
four: " 13 ",
five: " 296 ",
},
]
"""
m = re.match('({.*})', s, flags=re.S)
data = m.groups()
df = pd.DataFrame(data)
I suggest to add double quotes around the keys, then cast the string to a list of dictionaries and then simply read the structure into pandas dataframe using pd.from_dict
:
import pandas as pd
from ast import literal_eval
import re
s = "YOU STRING HERE"
fixed_s = re.sub(r"^(s*)(w+):", r'1"2":', s, flags=re.M)
df = pd.DataFrame.from_dict( ast.literal_eval(fixed_s) )
The ^(s*)(w+):
regex matches zero or more whitespaces at the start of any line (see the flags=re.M
that makes ^
match start of any line positions) capturing them into Group 1, and then matches one or more word chars capturing them into Group 2 and then matches a :
and then replaces the match with Group 1 + "
+ Group 2 + ":
.
The result is cast to a list of dictionaries using ast.literal_eval
.
Then, the list is used to initialize the dataframe.