regex formatting on a complex file name

Question:

I struggled to properly format regex to pull the first date in the following file name:

TEST_2022-03-04-05-30-20.csv_parsed.csv_encrypted.csv

I ran the following and expected to extract the first date in each file name formatted as 2022-03-04

date = re.search(‘b(d{4}-d{2}-d{2}).’, filename)

I obtained the following error on the re.search line: >AttributeError: ‘NoneType’ object has no attribute ‘group’

The answer below is helpful and resolved the issue. It is a valuable resource for learning how to utilize a regex search.

Asked By: S.Martinelli

||

Answers:

You need to check if the regex matched before you plunge ahead and try to extract the matched text.

for filename in filenames:
    match = re.search(r'b(d{4}-d{2}-d{2}).', filename)
    if not match:
        continue
    date = match.group(1)
    ...

Notice also the use of a r'...' raw string, and the use of group(1) to only extract the match from within the parenthesized expression.

Answered By: tripleee

There are a few problems with your regex.

First, the regex itself is incorrect:

b       # Match a word boundary (non-word character followed by word character or vice versa)
(        # followed by a group which consists of
  d{4}- # 4 digits and '-', then
  d{2}- # 2 digits and '-', then
  d{2}  # another 2 digits
)        # and eventually succeeded by
.       # a dot

Since your filename (TEST_2022-03-04-05-30-20.csv_parsed.csv_encrypted.csv) doesn’t have any such group, re.search() fails and returns None. Here is why:

  • 2022-03-04 is not succeeded by a dot
  • b does not match as both _ and 2 are considered word character.

That being said, the regex should be modified, like this:

(?<=_)   # Match something preceded by '_', which will not be included in our match,
d{4}-   # 4 digits and '-', then
d{2}-   # 2 digits and '-', then
d{2}    # another 2 digits, then
b       # a word boundary

Now, do you see those backslashes? Always remember that you need to escape them again in strings. This can be automated using raw strings:

r'(?<=_)d{4}-d{2}-d{2}b'

Try it:

filename = 'TEST_2022-03-04-05-30-20.csv_parsed.csv_encrypted.csv'
match = re.search(r'(?<=_)d{4}-d{2}-d{2}b', filename).group(0)

print(match) # '2022-03-04'
Answered By: InSync
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.