Extract substrings from a list of strings, where substrings are bounded by consistent characters

Question:

I have a list of lists of strings containing the taxonomies of different bacterial species. Each list has a consistent format:

[‘d__domain;p__phylum;c__class;o__order;f__family;g__genus;s__species’,’…’,’…’]

I’m trying to pull out the genera of each string in each list to find the unique genera. To do this, my idea was to make nested for loops that would split each string by ‘;’ and use list comprehension to search for ‘g__’, then lstrip off the g__ and append the associated genus name to a new, complimentary list. I attempted this in the code below:

finalList = []

for i in range(32586):
    
    outputList = []
    j = 0
    for j in taxonomyData.loc[i,'GTDB Taxonomy'][j]:
        
        ## Access Taxonomy column of Pandas dataframe and split by ;
        taxa = taxonomyData.loc[i,'GTDB Taxonomy'].split(';')
        
        ## Use list comprehension to pull out genus by g__
        genus = [x for x in taxa if 'g__' in x]
        if genus == [] :
            genus = 'None'
            
        ## lstrip off g__
        else:
            genus = genus[0].lstrip('g__')
            
            ## Append genus to new list of genera
            outputList.append(genus)
    ## Append new list of genera to larger list    
    finalList.append(outputList)
    print(finalList)
    break
    
    print(genus)

I tested this for loop and it successfully pulled the genus out of the first string of the first list, but when I let the for loop run, it skipped to the next list, leaving all the other items in the first list. Any advice on how I can get this loop to iterate through all the strings in the first list before moving on to subsequent lists?

Solved

Final Code:

finalList = []

for i in range(32586):
        
    ## Access Taxonomy column of Pandas dataframe and split by ;
    if pd.isna(taxonomyData.loc[i,'GTDB Taxonomy']) == True :
        genus_unique = ['None']
        finalList.append(genus_unique)
    else:
        taxa = taxonomyData.loc[i,'GTDB Taxonomy'].split(';')
        
        ## Use list comprehension to pull out genus by g__
        genus_unique = {x[3:] for x in taxa if x.startswith('g__')}
        genus_unique = list(genus_unique)
        
   
        ## Append new list of genera to larger list    
        finalList.append(genus_unique)
print(finalList)

Answers:

Here’s how you can get unique genus entries from a list with a single set comprehension:

taxa = ['d__abc', 'g__def', 'p__ghi', 'g__jkl', 'd__abc', 'g__def']
genus_unique = {x[3:] for x in taxa if x.startswith('g__')}
print(genus_unique)

Result:

{'def', 'jkl'}

You can also convert it into a list afterwards with list(genus_unique) if you need that.

Answered By: Robert Haas
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.