Python regex parse file name with underscore separated fields

Question:

I have the following format which parameterises a file name.

"{variable}_{domain}_{GCMsource}_{scenario}_{member}_{RCMsource}_{RCMversion}_{frequency}_{start}-{end}_{fid}.nc"

e.g.

"pr_EUR-11_CNRM-CERFACS-CNRM-CM5_rcp45_r1i1p1_CLMcom-CCLM4-8-17_v1_day_20060101-20101231.nc"

(Note that {start}-{end} is meant to be hyphon separated instead of underscore)

The various fields are always separated by underscores and contain a predictable (but variable) format. In the example file name I have left out the final {fid} field as I would like that to be optional.

I’d like to use regex in python to parse such a file name to give me a dict or similar with keys for the field names in the format string and the corresponding values of the parsed file name. e.g.

{
    "variable": "pr", 
    "domain", "EUR-11", 
    "GCMsource": "CNRM-CERFACS-CNRM-CM5", 
    "scenario": "rcp45", 
    "member": "r1i1p1", 
    "RCMsource": "CLMcom-CCLM4-8-17", 
    "RCMversion": "v1", 
    "frequency": "day", 
    "start": "20060101", 
    "end": "20101231".
    "fid": None
}

The regex patten for each field can be constrained depending on the field. e.g.

  • "domain" is always 3 letters – 2 numbers
  • "member" is always rWiXpY where W, X and Y are numbers.
  • "scenario" always contains the letters "rcp" followed by 2 numbers.
  • "start" and "end" are always 8 digit numbers (YYYYMMDD)

There are never underscores within a field, underscores are only used to separate fields.

Note that I have used https://github.com/r1chardj0n3s/parse with some success but I don’t think it is flexible enough for my needs (trying to parse other similar filenames with similar formats can often get confused with one another).

It would be great if the answer can explain some regex principles which will allow me to do this.

Asked By: ogb119

||

Answers:

document for regular expression in python: https://docs.python.org/3/howto/regex.html#regex-howto

named group in regular expression in python:
https://docs.python.org/3/howto/regex.html#non-capturing-and-named-groups

import re

test_string = """pr_EUR-11_CNRM-CERFACS-CNRM-CM5_rcp45_r1i1p1_CLMcom-CCLM4-8-17_v1_day_20060101-20101231.nc"""
pattern = r"""                       
(?P<variable>w+)_                      
(?P<domain>[a-zA-Z]{3}-d{2})_          
(?P<GCMsource>([A-Z0-9]+[-]?)+)_        
(?P<scenario>rcpd{2})_
(?P<member>([rip]d)+)_
(?P<RCMsource>([a-zA-Z0-9]-?)+)_
(?P<RCMversion>[a-zA-Z0-9]+)_
(?P<frequency>[a-zA-Z-0-9]+)_
(?P<start>d{8})-
(?P<end>d{8})
_?
(?P<fid>[a-zA-Z0-9]+)?
.nc
"""

re_object = re.compile(pattern, re.VERBOSE)  # we use VERBOSE flag

search_result = re_object.match(test_string)
print(search_result.groupdict())
# result:
"""
{'variable': 'pr', 'domain': 'EUR-11', 'GCMsource': 'CNRM-CERFACS-CNRM-CM5', 'scenario': 'rcp45', 'member': 'r1i1p1', 'RCMsource': 'CLMcom-CCLM4-8-17', 'RCMversion': 'v1', 'frequency': 'day', 'start': '20060101', 'end': '20101231', 'fid': None}
"""
Answered By: ali
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.