Python how to force one string to match format of another
Question:
I have a few Python scripts I have written for the Assessor’s office where I work. Most of these ask for an input parcel ID number (this is then used to grab certain data through an odbc). They are not very consistent about how they input parcel ID’s.
So here is my problem, they enter a parcel ID in one of 3 ways:
1: ‘1005191000060’
2: ‘001005191000060’
3: ‘0010-05-19-100-006-0’
The third way is the correct way, so I need to make sure the input is fixed to always match that format. Of course, they would rather type in the ID one of the first two ways. The parcel numbers must always be 15 digits long (20 with dashes)
I currently have a working method on how I fix the parcel ID, but it is very ugly. I am wondering if anyone knows a better way (or a more “Pythonic” way). I have a function that usually gets imported to all these scripts. Here is what I have:
import re
def FormatPID(in_pid):
pid_format = re.compile('d{4}-d{2}-d{2}-d{3}-d{3}-d{1}')
pid = in_pid.zfill(15)
if not pid_format.match(pid):
fixed_pid = '-'.join([pid[:4],pid[4:6],pid[6:8],pid[8:11],pid[11:-1],pid[-1]])
return fixed_pid
else:
return pid
if __name__ == '__main__':
pid = '1005191000060'
## pid = '001005191000060'
## pid = '0010-05-19-100-006-0'
# test
t = FormatPID(pid)
print t
This does work just fine, but I have been bothered by this ugly code for a while and I am thinking there has got to be a better way than slicing it. I am hoping there is a way I can “force” it to be converted to a string to match the “pid_format” variable. Any ideas? I couldn’t find anything to do this in the regular expressions module
Answers:
Instead of manual slicing you can use itertools.islice
:
import re
from itertools import islice
groups = (4, 2, 2, 3, 3, 1)
def FormatPID(in_pid):
pid_format = re.compile('d{4}-d{2}-d{2}-d{3}-d{3}-d{1}')
in_pid = in_pid.zfill(15)
if not pid_format.match(in_pid):
it = iter(in_pid)
return '-'.join(''.join(islice(it, i)) for i in groups)
return in_pid
print FormatPID('1005191000060')
print FormatPID('001005191000060')
print FormatPID('0010-05-19-100-006-0')
Output:
0010-05-19-100-006-0
0010-05-19-100-006-0
0010-05-19-100-006-0
I wouldn’t bother using regexes. You just want to get all the digits, ignoring hyphens, left-pad with 0s, then insert the hyphens in the right places, right? So:
def format_pid(pid):
p = pid.replace('-', '')
if not p.isdigit():
raise ValueError('Invalid format: {}'.format(pid))
p = p.zfill(15)
# You can use your `join` call instead of the following if you prefer.
# Or Ashwini's islice call.
return '{}-{}-{}-{}-{}-{}'.format(p[:4], p[4:6], p[6:8], p[8:11], p[11:14], p[14:])
All of these answers are a little over done, imho.
rstr is a helper module for easily generating random strings of
various types. It could be useful for fuzz testing, generating dummy
data, or other applications.
ASSESSOR_PARCEL = rstr.xeger('^\d{14}$')
print(ASSESSOR_PARCEL)
>>> 57203112454660
I have a few Python scripts I have written for the Assessor’s office where I work. Most of these ask for an input parcel ID number (this is then used to grab certain data through an odbc). They are not very consistent about how they input parcel ID’s.
So here is my problem, they enter a parcel ID in one of 3 ways:
1: ‘1005191000060’
2: ‘001005191000060’
3: ‘0010-05-19-100-006-0’
The third way is the correct way, so I need to make sure the input is fixed to always match that format. Of course, they would rather type in the ID one of the first two ways. The parcel numbers must always be 15 digits long (20 with dashes)
I currently have a working method on how I fix the parcel ID, but it is very ugly. I am wondering if anyone knows a better way (or a more “Pythonic” way). I have a function that usually gets imported to all these scripts. Here is what I have:
import re
def FormatPID(in_pid):
pid_format = re.compile('d{4}-d{2}-d{2}-d{3}-d{3}-d{1}')
pid = in_pid.zfill(15)
if not pid_format.match(pid):
fixed_pid = '-'.join([pid[:4],pid[4:6],pid[6:8],pid[8:11],pid[11:-1],pid[-1]])
return fixed_pid
else:
return pid
if __name__ == '__main__':
pid = '1005191000060'
## pid = '001005191000060'
## pid = '0010-05-19-100-006-0'
# test
t = FormatPID(pid)
print t
This does work just fine, but I have been bothered by this ugly code for a while and I am thinking there has got to be a better way than slicing it. I am hoping there is a way I can “force” it to be converted to a string to match the “pid_format” variable. Any ideas? I couldn’t find anything to do this in the regular expressions module
Instead of manual slicing you can use itertools.islice
:
import re
from itertools import islice
groups = (4, 2, 2, 3, 3, 1)
def FormatPID(in_pid):
pid_format = re.compile('d{4}-d{2}-d{2}-d{3}-d{3}-d{1}')
in_pid = in_pid.zfill(15)
if not pid_format.match(in_pid):
it = iter(in_pid)
return '-'.join(''.join(islice(it, i)) for i in groups)
return in_pid
print FormatPID('1005191000060')
print FormatPID('001005191000060')
print FormatPID('0010-05-19-100-006-0')
Output:
0010-05-19-100-006-0
0010-05-19-100-006-0
0010-05-19-100-006-0
I wouldn’t bother using regexes. You just want to get all the digits, ignoring hyphens, left-pad with 0s, then insert the hyphens in the right places, right? So:
def format_pid(pid):
p = pid.replace('-', '')
if not p.isdigit():
raise ValueError('Invalid format: {}'.format(pid))
p = p.zfill(15)
# You can use your `join` call instead of the following if you prefer.
# Or Ashwini's islice call.
return '{}-{}-{}-{}-{}-{}'.format(p[:4], p[4:6], p[6:8], p[8:11], p[11:14], p[14:])
All of these answers are a little over done, imho.
rstr is a helper module for easily generating random strings of
various types. It could be useful for fuzz testing, generating dummy
data, or other applications.
ASSESSOR_PARCEL = rstr.xeger('^\d{14}$')
print(ASSESSOR_PARCEL)
>>> 57203112454660