How to use REGEX in Pandas column to get a substring in the middle of a string

Question:

I need to get chemical attributes out of a Pandas column as values in a dictionary that was read in as a JSON-like string. I eliminated most of the special characters, and I want to use a Regular Expression to fill 3 dictionaries with casnumbers next to the chemical properties (for instance, I want a dictionary pair with the key as the casnumber {1} and the value {153 °C @ Press 12 Torr}.

This is the code I currently have- it correctly gives me the casnumbers that contain the strings "density", "melting point" and "boiling point", but I am messing up somewhere with the REGEX function to get the string between ‘Boiling Point property’ and ‘sourceNumber’.

data = [
'''{name Boiling Point property 153 °C @ Press 12 Torr sourceNumber 1} {name Density property 0.9211 g/cm<sup>3</sup> @ Temp 20 °C sourceNumber 1}''',
'''{name Boiling Point property 58 °C @ Press 12 Torr sourceNumber 1} {name Density property 0.8753 g/cm<sup>3</sup> @ Temp 20 °C sourceNumber 1}''',
'''{name Boiling Point property 175.5-176 °C @ Press 763 Torr sourceNumber 1} {name Melting Point property -74.35 °C sourceNumber 1} {name Density property 0.8402 g/cm<sup>3</sup> @ Temp 25 °C sourceNumber 1}''',
'''{name Boiling Point property 103-105 °C @ Press 16 Torr sourceNumber 1} {name Melting Point property 51 °C sourceNumber 1}''']
  
casnumber = [
       "1",
       "2",
       "3",
       "4"]

df = pd.DataFrame(data, columns=['casnumber','experimental_properties'])

#create dicts for attributes
boiling_points = {}
melting_points = {}
densities = {}
for index, row in df.iterrows():
    
    cas = str(row.casnumber)
    experimental_property = str(row.experimental_properties)
        
    if "Boiling Point" in experimental_property:
        boiling_point = regex.match('Boiling Point property (.*?)sourceNumber', experimental_property)
        boiling_points[cas] = boiling_point
        
    if "Melting Point" in experimental_property:
        melting_point = regex.match('Melting Point property (.*?)sourceNumber', experimental_property)
        melting_points[cas] = melting_point
        
    if "Density" in experimental_property:
        density = regex.match('Density property (.*?)sourceNumber', experimental_property)
        densities[cas] = density
 

This is what the DF looks like:
DF

The current code is giving me this for the boiling_points dict:
boiling_points

This spreadsheet is what I would want out of the sample code (what the REGEX function should be extracting from the large string):

desired output

I appreciate your help! This has been stumping me all day.

Asked By: CharlieBitMaFinga

||

Answers:

  1. Note regex.match tries to match with the start of the string.
    As your strings start with {name you also need to account for that.
    or instead use regex.search('Boiling Point property (.*?)sourceNumber', experimental_property)

  2. The return will be a regex match object or None. You can add a check if None was returned and raise an error or print a statement.
    boiling_points[cas] = boiling_point.groups(1) should then give you what you need.

  3. You can improve it further to get rid of your if statements.
    Because when boiling Point is not there the regex will return None.

    result = regex.search('Boiling Point property (.*?)sourceNumber', experimental_property)
    if result is not None:
       boiling_points[cas] = result.groups(1)
Answered By: Daraan

Is there any way to parse your data from the JSON-like string to a dict and use from_dict() to make your dataframe? Then you could have each of those properties in their own columns and access them as needed. If you have to use regex for this it would make sense to do that once in the beginning, where it could be helpful for streamlining data access later on.

Answered By: leonious

to get the temperature from this df:

df['value'] = df['data'].str.extract(r'(?<=property )(.+?)(?= sourceNumber)')

output:

casnumber data value
0 1 {name Boiling Point property 153 °C @ Press 12 Torr sourceNumber 1} {name Density property 0.9211 g/cm3 @ Temp 20 °C sourceNumber 1} 153 °C @ Press 12 Torr
1 2 {name Boiling Point property 58 °C @ Press 12 Torr sourceNumber 1} {name Density property 0.8753 g/cm3 @ Temp 20 °C sourceNumber 1} 58 °C @ Press 12 Torr
2 3 {name Boiling Point property 175.5-176 °C @ Press 763 Torr sourceNumber 1} {name Melting Point property -74.35 °C sourceNumber 1} {name Density property 0.8402 g/cm3 @ Temp 25 °C sourceNumber 1} 175.5-176 °C @ Press 763 Torr
3 4 {name Boiling Point property 103-105 °C @ Press 16 Torr sourceNumber 1} {name Melting Point property 51 °C sourceNumber 1} 103-105 °C @ Press 16 Torr
Answered By: MAFiA303