Python regex split without empty string

Question:

I have the following file names that exhibit this pattern:

000014_L_20111007T084734-20111008T023142.txt
000014_U_20111007T084734-20111008T023142.txt
...

I want to extract the middle two time stamp parts after the second underscore '_' and before '.txt'. So I used the following Python regex string split:

time_info = re.split('^[0-9]+_[LU]_|-|.txt$', f)

But this gives me two extra empty strings in the returned list:

time_info=['', '20111007T084734', '20111008T023142', '']

How do I get only the two time stamp information? i.e. I want:

time_info=['20111007T084734', '20111008T023142']
Asked By: tonga

||

Answers:

I’m no Python expert but maybe you could just remove the empty strings from your list?

str_list = re.split('^[0-9]+_[LU]_|-|.txt$', f)
time_info = filter(None, str_list)
Answered By: Elliot Bonneville

If the timestamps are always after the second _ then you can use str.split and str.strip:

>>> strs = "000014_L_20111007T084734-20111008T023142.txt"
>>> strs.strip(".txt").split("_",2)[-1].split("-")
['20111007T084734', '20111008T023142']
Answered By: Ashwini Chaudhary
>>> f='000014_L_20111007T084734-20111008T023142.txt'
>>> f[10:-4].split('-')
['0111007T084734', '20111008T023142']

or, somewhat more general:

>>> f[f.rfind('_')+1:-4].split('-')
['20111007T084734', '20111008T023142']
Answered By: Elazar

Don’t use re.split(), use the groups() method of regex Match/SRE_Match objects.

>>> f = '000014_L_20111007T084734-20111008T023142.txt'
>>> time_info = re.search(r'[LU]_(w+)-(w+).', f).groups()
>>> time_info
('20111007T084734', '20111008T023142')

You can even name the capturing groups and retrieve them in a dict, though you use groupdict() rather than groups() for that. (The regex pattern for such a case would be something like r'[LU]_(?P<groupA>w+)-(?P<groupB>w+).')

Answered By: JAB

Since this came up on google and for completeness, try using re.findall as an alternative!

This does require a little re-thinking, but it still returns a list of matches like split does. This makes it a nice drop-in replacement for some existing code and gets rid of the unwanted text. Pair it with lookaheads and/or lookbehinds and you get very similar behavior.

Yes, this is a bit of a “you’re asking the wrong question” answer and doesn’t use re.split(). It does solve the underlying issue- your list of matches suddenly have zero-length strings in it and you don’t want that.

Answered By: PipperChip
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.