How to parse text and extract values of parameters in Python

Question:

I have text file like this (it actually has 10000+ lines):

Generate placement
Place object 4 at (24,21)
Place object 21 at (89, 4)

Generate movement
At time 10, move object 4 to (3,65) with speed 10
At time 54, move object 21 to (43,6) with speed 4

Generate flux
Set intensity 10, simulation time 5

Here what I want to get from this file:

{
'placement': [{'object_placement': 4, 'location': (24,21)}, 
              {'object_placement': 21, 'location': (89, 4)}],
'movement': [{'time': 10, 'object': 4, 'destination': (3,65), 'speed': 10},
             {'time': 54, 'object': 21, 'destination': (43,6), 'speed': 4}],
'flux': [{'intensity': 10, 'simulation_time': 5}]
}

I was looking at this question but I am not sure if it is possible to use Template class or Jinja in my case.

Asked By: illuminato

||

Answers:

You can parse data in the desired format by using a set of regular expressions in your code. Here is a simple proof of concept.

import re

source = """Generate placement
Place object 4 at (24,21)
Place object 21 at (89, 4)

Generate movement
At time 10, move object 4 to (3,65) with speed 10
At time 54, move object 21 to (43,6) with speed 4

Generate flux
Set intensity 10, simulation time 5"""

results = {
    'placement': [],
    'movement': [],
    'flux': [],
}
state = ""

for line in source.splitlines():
    new_state = re.search("Generate ([a-zA-Z]+)", line)
    if new_state:
        state = new_state.groups()[0]
    else:
        # Parse the appropriate data for the current state
        if state == "placement":
            placement_data = re.search("Place object ([0-9]+) at (([0-9]+),s*([0-9]+))", line)
            if placement_data:
                placement_groups = placement_data.groups()
                results[state].append({'object_placement':placement_groups[0], 'location': (placement_groups[1],placement_groups[2])})
        elif state == "movement":
            movement_data = re.search("At time ([0-9]+), move object ([0-9]+) to (([0-9]+),s*([0-9]+)) with speed ([0-9]+)", line)
            if movement_data:
                movement_groups = movement_data.groups()
                results[state].append({'time':movement_groups[0], 'object': movement_groups[1], 'destination': (movement_groups[2], movement_groups[3]), 'speed': movement_groups[4]})
        elif state == "flux":
            flux_data = re.search("Set intensity ([0-9]+), simulation time ([0-9]+)", line)
            if flux_data:
                flux_groups = flux_data.groups()
                results[state].append({'intensity': flux_groups[0], 'simulation_time': flux_groups[1]})

print(results)

In the example above, results is the object that includes your parsed results. I used a regular expression based on each of the states you described for placement, movement, and flux and how those data are formatted in the source text.

The idea is to iterate over each line of the source text, first checking for a change of state. If you don’t change states, then the text that follows is expected to be formatted in a manner you described for the respective current state. Regular expressions are useful for capturing particular fields of a well defined data format. Finally, use the captured data from the regular expressions to populate the data structure that stores the results.

Here are the results from running the code above (formatted)

{
  placement: [
    { object_placement: '4', location: '21' },
    { object_placement: '21', location: '4' }
  ],
  movement: [
    { time: '10', object: '4', destination: '65', speed: '10' },
    { time: '54', object: '21', destination: '6', speed: '4' }
  ],
  flux: [ { intensity: '10', simulation_time: '5' } ]
}

Note that I have not converted the parsed data to a numeric format, but this can be easily done with the int() function. I’ll leave that as an exercise for the OP to implement.

Answered By: h0r53
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.