How to parse the regex text between multiline and between two braces?

Question:

I am new to python and trying to learn the regex by example.
In this example I am trying the extract the dictionary parts from the multiline text.
How to extract the parts between the two braces in the following example?

MWE: How to get pandas dataframe from this data?

import re

s = """
[
          {
            specialty: "Anatomic/Clinical Pathology",
            one: " 12,643 ",
            two: " 8,711 ",
            three: " 385 ",
            four: " 520 ",
            five: " 3,027 ",
          },
          {
            specialty: "Nephrology",
            one: " 11,407 ",
            two: " 9,964 ",
            three: " 140 ",
            four: " 316 ",
            five: " 987 ",
          },
          {
            specialty: "Vascular Surgery",
            one: " 3,943 ",
            two: " 3,586 ",
            three: " 48 ",
            four: " 13 ",
            five: " 296 ",
          },
        ]
"""

m = re.match('({.*})', s, flags=re.S)
data = m.groups()
df = pd.DataFrame(data)
Asked By: dallascow

||

Answers:

I suggest to add double quotes around the keys, then cast the string to a list of dictionaries and then simply read the structure into pandas dataframe using pd.from_dict:

import pandas as pd
from ast import literal_eval
import re

s = "YOU STRING HERE"
fixed_s = re.sub(r"^(s*)(w+):", r'1"2":', s, flags=re.M)
df = pd.DataFrame.from_dict( ast.literal_eval(fixed_s) )

The ^(s*)(w+): regex matches zero or more whitespaces at the start of any line (see the flags=re.M that makes ^ match start of any line positions) capturing them into Group 1, and then matches one or more word chars capturing them into Group 2 and then matches a : and then replaces the match with Group 1 + " + Group 2 + ":.

The result is cast to a list of dictionaries using ast.literal_eval.

Then, the list is used to initialize the dataframe.

Answered By: Wiktor Stribiżew
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.