Is it possible to use regex matching different part of one word in python?
Question:
I did a little research and it seems to be called non-consuming regex, but it is not working with re.sub()
. What I want to archive is to delete a part of a non-space-separating string. The pattern of original string is pubmed23n[0-9]*.xml.gz
, and I want to keep only the [0-9]*
part of the string.
Example:
Before:
pubmed23n0001.xml.gz
After:
0001
Now I need to walk two steps to archive that, namely substitute ^pubmed23n
and .xml.gz
or substitute (^pubmed23n)|(.xml.gz)
twice. I tried to use (?=pubmed23n)(?=.xml.gz)
or (?<=pubmed23n)(?<=.xml.gz)
but with no luck.
The strings are siting within a list and I want to delete some of the string based on a number. The current approach I’m trying to do is:
def remove_item(keep_from: int):
ptn1 = r'^pubmed23n'
ptn2 = r'.xml.gz$'
# some func to get and sort the list of file names with pattern `pubmed23n[0-9]*.xml.gz` and assign to `filename_list`
# example of filename_list: ['pubmed23n0001.xml.gz','pubmed23n0002.xml.gz','pubmed23n0003.xml.gz']
filename_cut_list = [re.sub(ptn1,'',i) for i in filename_list]
filename_cut_list = [re.sub(ptn2,'',i) for i in filename_cut_list]
filename_cut_list = [int(i) for i in filename_cut_list]
if filename_cut_list.index(keep_from) != 0:
del filename_list[:filename_cut_list.index(keep_from)+1]
Because the list to be handled could have thousands of element I want to make it to run once. But if I don’t even need to go though this much to delete the items with number smaller than given in the string, I also more than glad to hear!
Answers:
You’re replacing fixed strings, you don’t need a regular expression. Just use str.replace()
.
filename_cut_list = [int(i.replace('pubmed23n', '').replace('.xml.gz', '')) for i in filename_list]
What @Barmar said its correct. But if you really want to use regex, you can use the re.search()
function to do this.
re.search()
takes two arguments: The regex pattenr that you want and the string you want to apply the regex to.
So in this case:
filename_cut_list = [re.search(r'^pubmed23n([0-9]*).xml.gz$', x).group(1) for x in filename_list]
Hope this helps.
EDIT
This will fail if the pattern is not found in the filename. Since it will throw a Exception.
I did a little research and it seems to be called non-consuming regex, but it is not working with re.sub()
. What I want to archive is to delete a part of a non-space-separating string. The pattern of original string is pubmed23n[0-9]*.xml.gz
, and I want to keep only the [0-9]*
part of the string.
Example:
Before:
pubmed23n0001.xml.gz
After:
0001
Now I need to walk two steps to archive that, namely substitute ^pubmed23n
and .xml.gz
or substitute (^pubmed23n)|(.xml.gz)
twice. I tried to use (?=pubmed23n)(?=.xml.gz)
or (?<=pubmed23n)(?<=.xml.gz)
but with no luck.
The strings are siting within a list and I want to delete some of the string based on a number. The current approach I’m trying to do is:
def remove_item(keep_from: int):
ptn1 = r'^pubmed23n'
ptn2 = r'.xml.gz$'
# some func to get and sort the list of file names with pattern `pubmed23n[0-9]*.xml.gz` and assign to `filename_list`
# example of filename_list: ['pubmed23n0001.xml.gz','pubmed23n0002.xml.gz','pubmed23n0003.xml.gz']
filename_cut_list = [re.sub(ptn1,'',i) for i in filename_list]
filename_cut_list = [re.sub(ptn2,'',i) for i in filename_cut_list]
filename_cut_list = [int(i) for i in filename_cut_list]
if filename_cut_list.index(keep_from) != 0:
del filename_list[:filename_cut_list.index(keep_from)+1]
Because the list to be handled could have thousands of element I want to make it to run once. But if I don’t even need to go though this much to delete the items with number smaller than given in the string, I also more than glad to hear!
You’re replacing fixed strings, you don’t need a regular expression. Just use str.replace()
.
filename_cut_list = [int(i.replace('pubmed23n', '').replace('.xml.gz', '')) for i in filename_list]
What @Barmar said its correct. But if you really want to use regex, you can use the re.search()
function to do this.
re.search()
takes two arguments: The regex pattenr that you want and the string you want to apply the regex to.
So in this case:
filename_cut_list = [re.search(r'^pubmed23n([0-9]*).xml.gz$', x).group(1) for x in filename_list]
Hope this helps.
EDIT
This will fail if the pattern is not found in the filename. Since it will throw a Exception.