Python regex split without empty string
Question:
I have the following file names that exhibit this pattern:
000014_L_20111007T084734-20111008T023142.txt
000014_U_20111007T084734-20111008T023142.txt
...
I want to extract the middle two time stamp parts after the second underscore '_'
and before '.txt'
. So I used the following Python regex string split:
time_info = re.split('^[0-9]+_[LU]_|-|.txt$', f)
But this gives me two extra empty strings in the returned list:
time_info=['', '20111007T084734', '20111008T023142', '']
How do I get only the two time stamp information? i.e. I want:
time_info=['20111007T084734', '20111008T023142']
Answers:
I’m no Python expert but maybe you could just remove the empty strings from your list?
str_list = re.split('^[0-9]+_[LU]_|-|.txt$', f)
time_info = filter(None, str_list)
If the timestamps are always after the second _
then you can use str.split
and str.strip
:
>>> strs = "000014_L_20111007T084734-20111008T023142.txt"
>>> strs.strip(".txt").split("_",2)[-1].split("-")
['20111007T084734', '20111008T023142']
>>> f='000014_L_20111007T084734-20111008T023142.txt'
>>> f[10:-4].split('-')
['0111007T084734', '20111008T023142']
or, somewhat more general:
>>> f[f.rfind('_')+1:-4].split('-')
['20111007T084734', '20111008T023142']
Don’t use re.split()
, use the groups()
method of regex Match
/SRE_Match
objects.
>>> f = '000014_L_20111007T084734-20111008T023142.txt'
>>> time_info = re.search(r'[LU]_(w+)-(w+).', f).groups()
>>> time_info
('20111007T084734', '20111008T023142')
You can even name the capturing groups and retrieve them in a dict, though you use groupdict()
rather than groups()
for that. (The regex pattern for such a case would be something like r'[LU]_(?P<groupA>w+)-(?P<groupB>w+).'
)
Since this came up on google and for completeness, try using re.findall
as an alternative!
This does require a little re-thinking, but it still returns a list of matches like split does. This makes it a nice drop-in replacement for some existing code and gets rid of the unwanted text. Pair it with lookaheads and/or lookbehinds and you get very similar behavior.
Yes, this is a bit of a “you’re asking the wrong question” answer and doesn’t use re.split()
. It does solve the underlying issue- your list of matches suddenly have zero-length strings in it and you don’t want that.
I have the following file names that exhibit this pattern:
000014_L_20111007T084734-20111008T023142.txt
000014_U_20111007T084734-20111008T023142.txt
...
I want to extract the middle two time stamp parts after the second underscore '_'
and before '.txt'
. So I used the following Python regex string split:
time_info = re.split('^[0-9]+_[LU]_|-|.txt$', f)
But this gives me two extra empty strings in the returned list:
time_info=['', '20111007T084734', '20111008T023142', '']
How do I get only the two time stamp information? i.e. I want:
time_info=['20111007T084734', '20111008T023142']
I’m no Python expert but maybe you could just remove the empty strings from your list?
str_list = re.split('^[0-9]+_[LU]_|-|.txt$', f)
time_info = filter(None, str_list)
If the timestamps are always after the second _
then you can use str.split
and str.strip
:
>>> strs = "000014_L_20111007T084734-20111008T023142.txt"
>>> strs.strip(".txt").split("_",2)[-1].split("-")
['20111007T084734', '20111008T023142']
>>> f='000014_L_20111007T084734-20111008T023142.txt'
>>> f[10:-4].split('-')
['0111007T084734', '20111008T023142']
or, somewhat more general:
>>> f[f.rfind('_')+1:-4].split('-')
['20111007T084734', '20111008T023142']
Don’t use re.split()
, use the groups()
method of regex Match
/SRE_Match
objects.
>>> f = '000014_L_20111007T084734-20111008T023142.txt'
>>> time_info = re.search(r'[LU]_(w+)-(w+).', f).groups()
>>> time_info
('20111007T084734', '20111008T023142')
You can even name the capturing groups and retrieve them in a dict, though you use groupdict()
rather than groups()
for that. (The regex pattern for such a case would be something like r'[LU]_(?P<groupA>w+)-(?P<groupB>w+).'
)
Since this came up on google and for completeness, try using re.findall
as an alternative!
This does require a little re-thinking, but it still returns a list of matches like split does. This makes it a nice drop-in replacement for some existing code and gets rid of the unwanted text. Pair it with lookaheads and/or lookbehinds and you get very similar behavior.
Yes, this is a bit of a “you’re asking the wrong question” answer and doesn’t use re.split()
. It does solve the underlying issue- your list of matches suddenly have zero-length strings in it and you don’t want that.