Create a new variable instance each time I split a string in Python
Question:
I have a string into a variable x
that includes ">"
symbols. I would like to create a new variable each time the string is splitted at the ">"
symbol.
The string I have in the variable x
is as such (imported from a simple .txt
file):
>AF1785813
GTGTGGAGGGAAAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA
>AF1785815
GTGTGGAGTGAGCCAAGATCGCACCACTGCACTCCATTCAG
>AF1785814
GTGTGGAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA
The expected output is:
print(var_1)
>AF1785813
GTGTGGAGGGAAAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA
print(var_2)
>AF1785815
GTGTGGAGTGAGCCAAGATCGCACCACTGCACTCCATTCAG
print(var_3)
>AF1785814
GTGTGGAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA
To achieve this I am using a simple for
loop
count = 3
for v in range(0, count+1):
globals()[f"var_{v}"] = x.split('>')
print(var_3)
This way I am successfully getting a new variable for each count (each count is == to the number of ">"
).
However the output I am currently getting is:
print(var_1)
['', 'AF1785813GTGTGGAGGGAAAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA', 'AF1785815GTGTGGAGTGAGCCAAGATCGCACCACTGCACTCCATTCAG', 'AF1785814GTGTGGAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA']
print(var_2)
['', 'AF1785813GTGTGGAGGGAAAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA', 'AF1785815GTGTGGAGTGAGCCAAGATCGCACCACTGCACTCCATTCAG', 'AF1785814GTGTGGAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA']
print(var_3)
['', 'AF1785813GTGTGGAGGGAAAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA', 'AF1785815GTGTGGAGTGAGCCAAGATCGCACCACTGCACTCCATTCAG', 'AF1785814GTGTGGAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA']
How can I troubleshoot the for loop in order to achieve the expected output?
Answers:
I would use re.findall
here:
import re
inp = """>AF1785813
GTGTGGAGGGAAAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA
>AF1785815
GTGTGGAGTGAGCCAAGATCGCACCACTGCACTCCATTCAG
>AF1785814
GTGTGGAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA"""
vars = re.findall(r'>[^>]+', inp)
print(vars)
# ['>AF1785813nGTGTGGAGGGAAAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCAn',
# '>AF1785815nGTGTGGAGTGAGCCAAGATCGCACCACTGCACTCCATTCAGn',
# '>AF1785814nGTGTGGAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA']
Note that re.findall
returns all matches inside a single neat list, which can then be iterated or accessed later as needed.
Try to iterate the split result:
for i, token in enumerate(x.split('>')):
# do not include empty string
if token:
globals()[f"var_{i}"] = token
# then deal with the vars
print(var_1)
print(var_2)
..
Use the regular expression match the >
character followed by the characters on the line following it, up until the next >
character or the end of the string.
[^n]*
: This matches zero or more characters that are not newline characters.
[^>]*
: This matches zero or more characters that are not the > character.
import re
x = ">AF1785813nGTGTGGAGGGAAAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCAn>AF1785815nGTGTGGAGTGAGCCAAGATCGCACCACTGCACTCCATTCAGn>AF1785814nGTGTGGAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA"
substrings = re.findall(">[^n]*n[^>]*", x)
for i, substring in enumerate(substrings, start = 1):
globals()[f"var_{i}"] = substring
output:
>>> print(var_1)
>>> print(var_2)
>>> print(var_3)
>AF1785813
GTGTGGAGGGAAAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA
>AF1785815
GTGTGGAGTGAGCCAAGATCGCACCACTGCACTCCATTCAG
>AF1785814
GTGTGGAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA
I have a string into a variable x
that includes ">"
symbols. I would like to create a new variable each time the string is splitted at the ">"
symbol.
The string I have in the variable x
is as such (imported from a simple .txt
file):
>AF1785813
GTGTGGAGGGAAAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA
>AF1785815
GTGTGGAGTGAGCCAAGATCGCACCACTGCACTCCATTCAG
>AF1785814
GTGTGGAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA
The expected output is:
print(var_1)
>AF1785813
GTGTGGAGGGAAAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA
print(var_2)
>AF1785815
GTGTGGAGTGAGCCAAGATCGCACCACTGCACTCCATTCAG
print(var_3)
>AF1785814
GTGTGGAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA
To achieve this I am using a simple for
loop
count = 3
for v in range(0, count+1):
globals()[f"var_{v}"] = x.split('>')
print(var_3)
This way I am successfully getting a new variable for each count (each count is == to the number of ">"
).
However the output I am currently getting is:
print(var_1)
['', 'AF1785813GTGTGGAGGGAAAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA', 'AF1785815GTGTGGAGTGAGCCAAGATCGCACCACTGCACTCCATTCAG', 'AF1785814GTGTGGAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA']
print(var_2)
['', 'AF1785813GTGTGGAGGGAAAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA', 'AF1785815GTGTGGAGTGAGCCAAGATCGCACCACTGCACTCCATTCAG', 'AF1785814GTGTGGAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA']
print(var_3)
['', 'AF1785813GTGTGGAGGGAAAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA', 'AF1785815GTGTGGAGTGAGCCAAGATCGCACCACTGCACTCCATTCAG', 'AF1785814GTGTGGAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA']
How can I troubleshoot the for loop in order to achieve the expected output?
I would use re.findall
here:
import re
inp = """>AF1785813
GTGTGGAGGGAAAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA
>AF1785815
GTGTGGAGTGAGCCAAGATCGCACCACTGCACTCCATTCAG
>AF1785814
GTGTGGAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA"""
vars = re.findall(r'>[^>]+', inp)
print(vars)
# ['>AF1785813nGTGTGGAGGGAAAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCAn',
# '>AF1785815nGTGTGGAGTGAGCCAAGATCGCACCACTGCACTCCATTCAGn',
# '>AF1785814nGTGTGGAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA']
Note that re.findall
returns all matches inside a single neat list, which can then be iterated or accessed later as needed.
Try to iterate the split result:
for i, token in enumerate(x.split('>')):
# do not include empty string
if token:
globals()[f"var_{i}"] = token
# then deal with the vars
print(var_1)
print(var_2)
..
Use the regular expression match the >
character followed by the characters on the line following it, up until the next >
character or the end of the string.
[^n]*
: This matches zero or more characters that are not newline characters.
[^>]*
: This matches zero or more characters that are not the > character.
import re
x = ">AF1785813nGTGTGGAGGGAAAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCAn>AF1785815nGTGTGGAGTGAGCCAAGATCGCACCACTGCACTCCATTCAGn>AF1785814nGTGTGGAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA"
substrings = re.findall(">[^n]*n[^>]*", x)
for i, substring in enumerate(substrings, start = 1):
globals()[f"var_{i}"] = substring
output:
>>> print(var_1)
>>> print(var_2)
>>> print(var_3)
>AF1785813
GTGTGGAGGGAAAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA
>AF1785815
GTGTGGAGTGAGCCAAGATCGCACCACTGCACTCCATTCAG
>AF1785814
GTGTGGAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA