python/regex: match letter only or letter followed by number
Question:
I want to split this string ‘AB4F2D’ in [‘A’, ‘B4’, ‘F2’, ‘D’].
Essentially, if character is a letter, return the letter, if character is a number return previous character plus present character (luckily there is no number >9 so there is never a X12).
I have tried several combinations but I am not able to find the correct one:
def get_elements(input_string):
patterns = [
r'[A-Z][A-Z0-9]',
r'[A-Z][A-Z0-9]|[A-Z]',
r'D|Dd',
r'[A-Z]|[A-Z][0-9]',
r'[A-Z]{1}|[A-Z0-9]{1,2}'
]
for p in patterns:
elements = re.findall(p, input_string)
print(elements)
results:
['AB', 'F2']
['AB', 'F2', 'D']
['A', 'B', 'F', 'D']
['A', 'B', 'F', 'D']
['A', 'B', '4F', '2D']
Can anyone help? Thanks
Answers:
Dd?
One problem with yours is that you put the shorter alternative first, so the longer one never gets a chance. For example, the correct version of your D|Dd
is Dd|D
. But just use Dd?
.
Use Extended Groups
There is special syntax for python regexes allowing you to match ahead without consuming the characters (and much more).
Here is a pattern I would come up with using that:
[A-Z](?![0-9])|[A-Z][0-9]
This matches everything in just one pattern. There might be simpler ways to match it, but I find this to be the most flexible if you want to adjust it later. Read it like this: greedily match a letter if the next character is not a digit. If that is not the case, match a letter followed by a digit.
More info in the docs. If you want to test around I recommend using a regex tester like this and make sure to select python syntax.
I want to split this string ‘AB4F2D’ in [‘A’, ‘B4’, ‘F2’, ‘D’].
Essentially, if character is a letter, return the letter, if character is a number return previous character plus present character (luckily there is no number >9 so there is never a X12).
I have tried several combinations but I am not able to find the correct one:
def get_elements(input_string):
patterns = [
r'[A-Z][A-Z0-9]',
r'[A-Z][A-Z0-9]|[A-Z]',
r'D|Dd',
r'[A-Z]|[A-Z][0-9]',
r'[A-Z]{1}|[A-Z0-9]{1,2}'
]
for p in patterns:
elements = re.findall(p, input_string)
print(elements)
results:
['AB', 'F2']
['AB', 'F2', 'D']
['A', 'B', 'F', 'D']
['A', 'B', 'F', 'D']
['A', 'B', '4F', '2D']
Can anyone help? Thanks
Dd?
One problem with yours is that you put the shorter alternative first, so the longer one never gets a chance. For example, the correct version of your D|Dd
is Dd|D
. But just use Dd?
.
Use Extended Groups
There is special syntax for python regexes allowing you to match ahead without consuming the characters (and much more).
Here is a pattern I would come up with using that:
[A-Z](?![0-9])|[A-Z][0-9]
This matches everything in just one pattern. There might be simpler ways to match it, but I find this to be the most flexible if you want to adjust it later. Read it like this: greedily match a letter if the next character is not a digit. If that is not the case, match a letter followed by a digit.
More info in the docs. If you want to test around I recommend using a regex tester like this and make sure to select python syntax.