Use Python format string in reverse for parsing
Question:
I’ve been using the following python code to format an integer part ID as a formatted part number string:
pn = 'PN-{:0>9}'.format(id)
I would like to know if there is a way to use that same format string ('PN-{:0>9}'
) in reverse to extract the integer ID from the formatted part number. If that can’t be done, is there a way to use a single format string (or regex?) to create and parse?
Answers:
How about:
id = int(pn.split('-')[1])
This splits the part number at the dash, takes the second component and converts it to integer.
P.S. I’ve kept id
as the variable name so that the connection to your question is clear. It is a good idea to rename that variable that it doesn’t shadow the built-in function.
You might find simulating scanf interresting.
The parse module “is the opposite of format()”.
Example usage:
>>> import parse
>>> format_string = 'PN-{:0>9}'
>>> id = 123
>>> pn = format_string.format(id)
>>> pn
'PN-000000123'
>>> parsed = parse.parse(format_string, pn)
>>> parsed
<Result ('123',) {}>
>>> parsed[0]
'123'
Here’s a solution in case you don’t want to use the parse module. It converts format strings into regular expressions with named groups. It makes a few assumptions (described in the docstring) that were okay in my case, but may not be okay in yours.
def match_format_string(format_str, s):
"""Match s against the given format string, return dict of matches.
We assume all of the arguments in format string are named keyword arguments (i.e. no {} or
{:0.2f}). We also assume that all chars are allowed in each keyword argument, so separators
need to be present which aren't present in the keyword arguments (i.e. '{one}{two}' won't work
reliably as a format string but '{one}-{two}' will if the hyphen isn't used in {one} or {two}).
We raise if the format string does not match s.
Example:
fs = '{test}-{flight}-{go}'
s = fs.format('first', 'second', 'third')
match_format_string(fs, s) -> {'test': 'first', 'flight': 'second', 'go': 'third'}
"""
# First split on any keyword arguments, note that the names of keyword arguments will be in the
# 1st, 3rd, ... positions in this list
tokens = re.split(r'{(.*?)}', format_str)
keywords = tokens[1::2]
# Now replace keyword arguments with named groups matching them. We also escape between keyword
# arguments so we support meta-characters there. Re-join tokens to form our regexp pattern
tokens[1::2] = map(u'(?P<{}>.*)'.format, keywords)
tokens[0::2] = map(re.escape, tokens[0::2])
pattern = ''.join(tokens)
# Use our pattern to match the given string, raise if it doesn't match
matches = re.match(pattern, s)
if not matches:
raise Exception("Format string did not match")
# Return a dict with all of our keywords and their values
return {x: matches.group(x) for x in keywords}
Use lucidity
import lucidty
template = lucidity.Template('model', '/jobs/{job}/assets/{asset_name}/model/{lod}/{asset_name}_{lod}_v{version}.{filetype}')
path = '/jobs/monty/assets/circus/model/high/circus_high_v001.abc'
data = template.parse(path)
print(data)
# Output
# {'job': 'monty',
# 'asset_name': 'circus',
# 'lod': 'high',
# 'version': '001',
# 'filetype': 'abc'}
I’ve been using the following python code to format an integer part ID as a formatted part number string:
pn = 'PN-{:0>9}'.format(id)
I would like to know if there is a way to use that same format string ('PN-{:0>9}'
) in reverse to extract the integer ID from the formatted part number. If that can’t be done, is there a way to use a single format string (or regex?) to create and parse?
How about:
id = int(pn.split('-')[1])
This splits the part number at the dash, takes the second component and converts it to integer.
P.S. I’ve kept id
as the variable name so that the connection to your question is clear. It is a good idea to rename that variable that it doesn’t shadow the built-in function.
You might find simulating scanf interresting.
The parse module “is the opposite of format()”.
Example usage:
>>> import parse
>>> format_string = 'PN-{:0>9}'
>>> id = 123
>>> pn = format_string.format(id)
>>> pn
'PN-000000123'
>>> parsed = parse.parse(format_string, pn)
>>> parsed
<Result ('123',) {}>
>>> parsed[0]
'123'
Here’s a solution in case you don’t want to use the parse module. It converts format strings into regular expressions with named groups. It makes a few assumptions (described in the docstring) that were okay in my case, but may not be okay in yours.
def match_format_string(format_str, s):
"""Match s against the given format string, return dict of matches.
We assume all of the arguments in format string are named keyword arguments (i.e. no {} or
{:0.2f}). We also assume that all chars are allowed in each keyword argument, so separators
need to be present which aren't present in the keyword arguments (i.e. '{one}{two}' won't work
reliably as a format string but '{one}-{two}' will if the hyphen isn't used in {one} or {two}).
We raise if the format string does not match s.
Example:
fs = '{test}-{flight}-{go}'
s = fs.format('first', 'second', 'third')
match_format_string(fs, s) -> {'test': 'first', 'flight': 'second', 'go': 'third'}
"""
# First split on any keyword arguments, note that the names of keyword arguments will be in the
# 1st, 3rd, ... positions in this list
tokens = re.split(r'{(.*?)}', format_str)
keywords = tokens[1::2]
# Now replace keyword arguments with named groups matching them. We also escape between keyword
# arguments so we support meta-characters there. Re-join tokens to form our regexp pattern
tokens[1::2] = map(u'(?P<{}>.*)'.format, keywords)
tokens[0::2] = map(re.escape, tokens[0::2])
pattern = ''.join(tokens)
# Use our pattern to match the given string, raise if it doesn't match
matches = re.match(pattern, s)
if not matches:
raise Exception("Format string did not match")
# Return a dict with all of our keywords and their values
return {x: matches.group(x) for x in keywords}
Use lucidity
import lucidty
template = lucidity.Template('model', '/jobs/{job}/assets/{asset_name}/model/{lod}/{asset_name}_{lod}_v{version}.{filetype}')
path = '/jobs/monty/assets/circus/model/high/circus_high_v001.abc'
data = template.parse(path)
print(data)
# Output
# {'job': 'monty',
# 'asset_name': 'circus',
# 'lod': 'high',
# 'version': '001',
# 'filetype': 'abc'}