regex formatting on a complex file name
Question:
I struggled to properly format regex to pull the first date in the following file name:
TEST_2022-03-04-05-30-20.csv_parsed.csv_encrypted.csv
I ran the following and expected to extract the first date in each file name formatted as 2022-03-04
date = re.search(‘b(d{4}-d{2}-d{2}).’, filename)
I obtained the following error on the re.search line: >AttributeError: ‘NoneType’ object has no attribute ‘group’
The answer below is helpful and resolved the issue. It is a valuable resource for learning how to utilize a regex search.
Answers:
You need to check if the regex matched before you plunge ahead and try to extract the matched text.
for filename in filenames:
match = re.search(r'b(d{4}-d{2}-d{2}).', filename)
if not match:
continue
date = match.group(1)
...
Notice also the use of a r'...'
raw string, and the use of group(1)
to only extract the match from within the parenthesized expression.
There are a few problems with your regex.
First, the regex itself is incorrect:
b # Match a word boundary (non-word character followed by word character or vice versa)
( # followed by a group which consists of
d{4}- # 4 digits and '-', then
d{2}- # 2 digits and '-', then
d{2} # another 2 digits
) # and eventually succeeded by
. # a dot
Since your filename
(TEST_2022-03-04-05-30-20.csv_parsed.csv_encrypted.csv
) doesn’t have any such group, re.search()
fails and returns None
. Here is why:
2022-03-04
is not succeeded by a dot
b
does not match as both _
and 2
are considered word character.
That being said, the regex should be modified, like this:
(?<=_) # Match something preceded by '_', which will not be included in our match,
d{4}- # 4 digits and '-', then
d{2}- # 2 digits and '-', then
d{2} # another 2 digits, then
b # a word boundary
Now, do you see those backslashes? Always remember that you need to escape them again in strings. This can be automated using raw strings:
r'(?<=_)d{4}-d{2}-d{2}b'
Try it:
filename = 'TEST_2022-03-04-05-30-20.csv_parsed.csv_encrypted.csv'
match = re.search(r'(?<=_)d{4}-d{2}-d{2}b', filename).group(0)
print(match) # '2022-03-04'
I struggled to properly format regex to pull the first date in the following file name:
TEST_2022-03-04-05-30-20.csv_parsed.csv_encrypted.csv
I ran the following and expected to extract the first date in each file name formatted as 2022-03-04
date = re.search(‘b(d{4}-d{2}-d{2}).’, filename)
I obtained the following error on the re.search line: >AttributeError: ‘NoneType’ object has no attribute ‘group’
The answer below is helpful and resolved the issue. It is a valuable resource for learning how to utilize a regex search.
You need to check if the regex matched before you plunge ahead and try to extract the matched text.
for filename in filenames:
match = re.search(r'b(d{4}-d{2}-d{2}).', filename)
if not match:
continue
date = match.group(1)
...
Notice also the use of a r'...'
raw string, and the use of group(1)
to only extract the match from within the parenthesized expression.
There are a few problems with your regex.
First, the regex itself is incorrect:
b # Match a word boundary (non-word character followed by word character or vice versa)
( # followed by a group which consists of
d{4}- # 4 digits and '-', then
d{2}- # 2 digits and '-', then
d{2} # another 2 digits
) # and eventually succeeded by
. # a dot
Since your filename
(TEST_2022-03-04-05-30-20.csv_parsed.csv_encrypted.csv
) doesn’t have any such group, re.search()
fails and returns None
. Here is why:
2022-03-04
is not succeeded by a dotb
does not match as both_
and2
are considered word character.
That being said, the regex should be modified, like this:
(?<=_) # Match something preceded by '_', which will not be included in our match,
d{4}- # 4 digits and '-', then
d{2}- # 2 digits and '-', then
d{2} # another 2 digits, then
b # a word boundary
Now, do you see those backslashes? Always remember that you need to escape them again in strings. This can be automated using raw strings:
r'(?<=_)d{4}-d{2}-d{2}b'
Try it:
filename = 'TEST_2022-03-04-05-30-20.csv_parsed.csv_encrypted.csv'
match = re.search(r'(?<=_)d{4}-d{2}-d{2}b', filename).group(0)
print(match) # '2022-03-04'