Python regex parse file name with underscore separated fields
Question:
I have the following format which parameterises a file name.
"{variable}_{domain}_{GCMsource}_{scenario}_{member}_{RCMsource}_{RCMversion}_{frequency}_{start}-{end}_{fid}.nc"
e.g.
"pr_EUR-11_CNRM-CERFACS-CNRM-CM5_rcp45_r1i1p1_CLMcom-CCLM4-8-17_v1_day_20060101-20101231.nc"
(Note that {start}-{end} is meant to be hyphon separated instead of underscore)
The various fields are always separated by underscores and contain a predictable (but variable) format. In the example file name I have left out the final {fid}
field as I would like that to be optional.
I’d like to use regex in python to parse such a file name to give me a dict or similar with keys for the field names in the format string and the corresponding values of the parsed file name. e.g.
{
"variable": "pr",
"domain", "EUR-11",
"GCMsource": "CNRM-CERFACS-CNRM-CM5",
"scenario": "rcp45",
"member": "r1i1p1",
"RCMsource": "CLMcom-CCLM4-8-17",
"RCMversion": "v1",
"frequency": "day",
"start": "20060101",
"end": "20101231".
"fid": None
}
The regex patten for each field can be constrained depending on the field. e.g.
- "domain" is always 3 letters – 2 numbers
- "member" is always rWiXpY where W, X and Y are numbers.
- "scenario" always contains the letters "rcp" followed by 2 numbers.
- "start" and "end" are always 8 digit numbers (YYYYMMDD)
There are never underscores within a field, underscores are only used to separate fields.
Note that I have used https://github.com/r1chardj0n3s/parse with some success but I don’t think it is flexible enough for my needs (trying to parse other similar filenames with similar formats can often get confused with one another).
It would be great if the answer can explain some regex principles which will allow me to do this.
Answers:
document for regular expression in python: https://docs.python.org/3/howto/regex.html#regex-howto
named group in regular expression in python:
https://docs.python.org/3/howto/regex.html#non-capturing-and-named-groups
import re
test_string = """pr_EUR-11_CNRM-CERFACS-CNRM-CM5_rcp45_r1i1p1_CLMcom-CCLM4-8-17_v1_day_20060101-20101231.nc"""
pattern = r"""
(?P<variable>w+)_
(?P<domain>[a-zA-Z]{3}-d{2})_
(?P<GCMsource>([A-Z0-9]+[-]?)+)_
(?P<scenario>rcpd{2})_
(?P<member>([rip]d)+)_
(?P<RCMsource>([a-zA-Z0-9]-?)+)_
(?P<RCMversion>[a-zA-Z0-9]+)_
(?P<frequency>[a-zA-Z-0-9]+)_
(?P<start>d{8})-
(?P<end>d{8})
_?
(?P<fid>[a-zA-Z0-9]+)?
.nc
"""
re_object = re.compile(pattern, re.VERBOSE) # we use VERBOSE flag
search_result = re_object.match(test_string)
print(search_result.groupdict())
# result:
"""
{'variable': 'pr', 'domain': 'EUR-11', 'GCMsource': 'CNRM-CERFACS-CNRM-CM5', 'scenario': 'rcp45', 'member': 'r1i1p1', 'RCMsource': 'CLMcom-CCLM4-8-17', 'RCMversion': 'v1', 'frequency': 'day', 'start': '20060101', 'end': '20101231', 'fid': None}
"""
I have the following format which parameterises a file name.
"{variable}_{domain}_{GCMsource}_{scenario}_{member}_{RCMsource}_{RCMversion}_{frequency}_{start}-{end}_{fid}.nc"
e.g.
"pr_EUR-11_CNRM-CERFACS-CNRM-CM5_rcp45_r1i1p1_CLMcom-CCLM4-8-17_v1_day_20060101-20101231.nc"
(Note that {start}-{end} is meant to be hyphon separated instead of underscore)
The various fields are always separated by underscores and contain a predictable (but variable) format. In the example file name I have left out the final {fid}
field as I would like that to be optional.
I’d like to use regex in python to parse such a file name to give me a dict or similar with keys for the field names in the format string and the corresponding values of the parsed file name. e.g.
{
"variable": "pr",
"domain", "EUR-11",
"GCMsource": "CNRM-CERFACS-CNRM-CM5",
"scenario": "rcp45",
"member": "r1i1p1",
"RCMsource": "CLMcom-CCLM4-8-17",
"RCMversion": "v1",
"frequency": "day",
"start": "20060101",
"end": "20101231".
"fid": None
}
The regex patten for each field can be constrained depending on the field. e.g.
- "domain" is always 3 letters – 2 numbers
- "member" is always rWiXpY where W, X and Y are numbers.
- "scenario" always contains the letters "rcp" followed by 2 numbers.
- "start" and "end" are always 8 digit numbers (YYYYMMDD)
There are never underscores within a field, underscores are only used to separate fields.
Note that I have used https://github.com/r1chardj0n3s/parse with some success but I don’t think it is flexible enough for my needs (trying to parse other similar filenames with similar formats can often get confused with one another).
It would be great if the answer can explain some regex principles which will allow me to do this.
document for regular expression in python: https://docs.python.org/3/howto/regex.html#regex-howto
named group in regular expression in python:
https://docs.python.org/3/howto/regex.html#non-capturing-and-named-groups
import re
test_string = """pr_EUR-11_CNRM-CERFACS-CNRM-CM5_rcp45_r1i1p1_CLMcom-CCLM4-8-17_v1_day_20060101-20101231.nc"""
pattern = r"""
(?P<variable>w+)_
(?P<domain>[a-zA-Z]{3}-d{2})_
(?P<GCMsource>([A-Z0-9]+[-]?)+)_
(?P<scenario>rcpd{2})_
(?P<member>([rip]d)+)_
(?P<RCMsource>([a-zA-Z0-9]-?)+)_
(?P<RCMversion>[a-zA-Z0-9]+)_
(?P<frequency>[a-zA-Z-0-9]+)_
(?P<start>d{8})-
(?P<end>d{8})
_?
(?P<fid>[a-zA-Z0-9]+)?
.nc
"""
re_object = re.compile(pattern, re.VERBOSE) # we use VERBOSE flag
search_result = re_object.match(test_string)
print(search_result.groupdict())
# result:
"""
{'variable': 'pr', 'domain': 'EUR-11', 'GCMsource': 'CNRM-CERFACS-CNRM-CM5', 'scenario': 'rcp45', 'member': 'r1i1p1', 'RCMsource': 'CLMcom-CCLM4-8-17', 'RCMversion': 'v1', 'frequency': 'day', 'start': '20060101', 'end': '20101231', 'fid': None}
"""