Is it possible to use regex matching different part of one word in python?


I did a little research and it seems to be called non-consuming regex, but it is not working with re.sub(). What I want to archive is to delete a part of a non-space-separating string. The pattern of original string is pubmed23n[0-9]*.xml.gz, and I want to keep only the [0-9]* part of the string.





Now I need to walk two steps to archive that, namely substitute ^pubmed23n and .xml.gz or substitute (^pubmed23n)|(.xml.gz) twice. I tried to use (?=pubmed23n)(?=.xml.gz) or (?<=pubmed23n)(?<=.xml.gz) but with no luck.

The strings are siting within a list and I want to delete some of the string based on a number. The current approach I’m trying to do is:

def remove_item(keep_from: int):
    ptn1 = r'^pubmed23n'
    ptn2 = r'.xml.gz$'
    # some func to get and sort the list of file names with pattern `pubmed23n[0-9]*.xml.gz` and assign to `filename_list`
    # example of filename_list: ['pubmed23n0001.xml.gz','pubmed23n0002.xml.gz','pubmed23n0003.xml.gz']
    filename_cut_list = [re.sub(ptn1,'',i) for i in filename_list]
    filename_cut_list = [re.sub(ptn2,'',i) for i in filename_cut_list]
    filename_cut_list = [int(i) for i in filename_cut_list]
    if filename_cut_list.index(keep_from) != 0:
        del filename_list[:filename_cut_list.index(keep_from)+1]

Because the list to be handled could have thousands of element I want to make it to run once. But if I don’t even need to go though this much to delete the items with number smaller than given in the string, I also more than glad to hear!

Asked By: jimmymcheung



You’re replacing fixed strings, you don’t need a regular expression. Just use str.replace().

filename_cut_list = [int(i.replace('pubmed23n', '').replace('.xml.gz', '')) for i in filename_list]
Answered By: Barmar

What @Barmar said its correct. But if you really want to use regex, you can use the function to do this. takes two arguments: The regex pattenr that you want and the string you want to apply the regex to.

So in this case:

filename_cut_list = ['^pubmed23n([0-9]*).xml.gz$', x).group(1) for x in filename_list]

Hope this helps.


This will fail if the pattern is not found in the filename. Since it will throw a Exception.

Answered By: Rkolay
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.