Create a new variable instance each time I split a string in Python

Question:

I have a string into a variable x that includes ">" symbols. I would like to create a new variable each time the string is splitted at the ">" symbol.

The string I have in the variable x is as such (imported from a simple .txt file):

>AF1785813
GTGTGGAGGGAAAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA
>AF1785815
GTGTGGAGTGAGCCAAGATCGCACCACTGCACTCCATTCAG
>AF1785814
GTGTGGAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA

The expected output is:

print(var_1)

>AF1785813
GTGTGGAGGGAAAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA

print(var_2)

>AF1785815
GTGTGGAGTGAGCCAAGATCGCACCACTGCACTCCATTCAG

print(var_3)

>AF1785814
GTGTGGAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA

To achieve this I am using a simple for loop

count = 3
for v in range(0, count+1):
    globals()[f"var_{v}"] = x.split('>')
print(var_3)

This way I am successfully getting a new variable for each count (each count is == to the number of ">").

However the output I am currently getting is:

print(var_1)
        
['', 'AF1785813GTGTGGAGGGAAAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA', 'AF1785815GTGTGGAGTGAGCCAAGATCGCACCACTGCACTCCATTCAG', 'AF1785814GTGTGGAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA']
            
print(var_2)

['', 'AF1785813GTGTGGAGGGAAAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA', 'AF1785815GTGTGGAGTGAGCCAAGATCGCACCACTGCACTCCATTCAG', 'AF1785814GTGTGGAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA']
            
print(var_3)
        
['', 'AF1785813GTGTGGAGGGAAAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA', 'AF1785815GTGTGGAGTGAGCCAAGATCGCACCACTGCACTCCATTCAG', 'AF1785814GTGTGGAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA']

How can I troubleshoot the for loop in order to achieve the expected output?

Asked By: d.cio

||

Answers:

I would use re.findall here:

import re

inp = """>AF1785813
GTGTGGAGGGAAAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA
>AF1785815
GTGTGGAGTGAGCCAAGATCGCACCACTGCACTCCATTCAG
>AF1785814
GTGTGGAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA"""

vars = re.findall(r'>[^>]+', inp)
print(vars)

# ['>AF1785813nGTGTGGAGGGAAAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCAn',
#  '>AF1785815nGTGTGGAGTGAGCCAAGATCGCACCACTGCACTCCATTCAGn',
#  '>AF1785814nGTGTGGAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA']

Note that re.findall returns all matches inside a single neat list, which can then be iterated or accessed later as needed.

Answered By: Tim Biegeleisen

Try to iterate the split result:

for i, token in enumerate(x.split('>')):
    # do not include empty string
    if token:
        globals()[f"var_{i}"] = token

# then deal with the vars
print(var_1)
print(var_2)
..
Answered By: dimnnv

Use the regular expression match the > character followed by the characters on the line following it, up until the next > character or the end of the string.

[^n]*: This matches zero or more characters that are not newline characters.

[^>]*: This matches zero or more characters that are not the > character.

import re

x = ">AF1785813nGTGTGGAGGGAAAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCAn>AF1785815nGTGTGGAGTGAGCCAAGATCGCACCACTGCACTCCATTCAGn>AF1785814nGTGTGGAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA"

substrings = re.findall(">[^n]*n[^>]*", x)

for i, substring in enumerate(substrings, start = 1):
    globals()[f"var_{i}"] = substring

output:

>>> print(var_1)
>>> print(var_2)
>>> print(var_3)

>AF1785813
GTGTGGAGGGAAAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA

>AF1785815
GTGTGGAGTGAGCCAAGATCGCACCACTGCACTCCATTCAG

>AF1785814
GTGTGGAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA
Answered By: JayPeerachai