How can I extract a portion of text from all lines of a file?
Question:
I have these sequences:
0,<|endoftext|>ERRDLLRFKH:GAGCGCCGCGACCTGTTACGATTTAAACAC<|endoftext|>
1,<|endoftext|>RRDLLRFKHG:CGCCGCGACCTGTTACGATTTAAACACGGC<|endoftext|>
2,<|endoftext|>RDLLRFKHGD:CGCGACCTGTTACGATTTAAACACGGCGAC<|endoftext|>
3,<|endoftext|>DLLRFKHGDS:GACCTGTTACGATTTAAACACGGCGACAGT<|endoftext|>
And I’d like to get only the aminoacid sequences, like this:
ERRDLLRFKH:
RRDLLRFKHG:
RDLLRFKHGD:
DLLRFKHGDS:
I have wrote this script so far:
with open("example_val.txt") as f:
for line in f:
if line.startswith(""):
line = line[:-1]
print(line.split(":", 1))
Nevertheless, I got only the original sequences. Please give me some advice.
Answers:
Regex solution:
import re
with open("example_val.txt") as f:
re.findall("(?<=>)[a-zA-Z]*:", f.read())
Regex Explanation:
(?<=>)
: is a positive lookbehind which finds the > character before our match
[a-zA-Z]*:
: matches zero or more of characters present in a-z and A-Z with the colon at the end
Test in Regex101 : regex101.com/r/qVGCYF/1
First, remember that storing something (e.g. in a list) is not the same as printing it — if you need to use it later, you need to store all your amino acid sequences in a list when you parse your file. If you just want to display them and do nothing else, it’s fine to print.
You have a bunch of ways to do this:
-
Use a regular expression with a lookbehind like johann’s answer
-
Use a CSV reader to isolate just the second column of your comma-separated text file, and then slice the string, since you know the value you want starts at the 13th index and ends at the 23rd index
import csv
sequences = [] # Create an empty list to contain all sequences
with open("example_val.txt") as f:
reader = csv.reader(f)
for row in reader:
element = row[1] # Get the second element in the row
seq = element[13:24] # Slice the element
sequences.append(seq) # Append to the list
print(seq) # Or print the current sequence
- Find the index of
<|endoftext|>
in the string. Relative to this index i
, you know that your sequence starts at the index i + len('<|endoftext|>')
, and ends at i + len('<|endoftext|>') + 10
with open("example_val.txt") as f:
for line in f:
i = line.find('<|endoftext|>')
seq_start = i + len('<|endoftext|>')
seq_end = seq_start + 10
seq = line[seq_start:seq_end+1] # Slice the line
sequences.append(seq) # Append to the list
print(seq) # Or print the current sequence
I have these sequences:
0,<|endoftext|>ERRDLLRFKH:GAGCGCCGCGACCTGTTACGATTTAAACAC<|endoftext|>
1,<|endoftext|>RRDLLRFKHG:CGCCGCGACCTGTTACGATTTAAACACGGC<|endoftext|>
2,<|endoftext|>RDLLRFKHGD:CGCGACCTGTTACGATTTAAACACGGCGAC<|endoftext|>
3,<|endoftext|>DLLRFKHGDS:GACCTGTTACGATTTAAACACGGCGACAGT<|endoftext|>
And I’d like to get only the aminoacid sequences, like this:
ERRDLLRFKH:
RRDLLRFKHG:
RDLLRFKHGD:
DLLRFKHGDS:
I have wrote this script so far:
with open("example_val.txt") as f:
for line in f:
if line.startswith(""):
line = line[:-1]
print(line.split(":", 1))
Nevertheless, I got only the original sequences. Please give me some advice.
Regex solution:
import re
with open("example_val.txt") as f:
re.findall("(?<=>)[a-zA-Z]*:", f.read())
Regex Explanation:
(?<=>)
: is a positive lookbehind which finds the > character before our match[a-zA-Z]*:
: matches zero or more of characters present in a-z and A-Z with the colon at the end
Test in Regex101 : regex101.com/r/qVGCYF/1
First, remember that storing something (e.g. in a list) is not the same as printing it — if you need to use it later, you need to store all your amino acid sequences in a list when you parse your file. If you just want to display them and do nothing else, it’s fine to print.
You have a bunch of ways to do this:
-
Use a regular expression with a lookbehind like johann’s answer
-
Use a CSV reader to isolate just the second column of your comma-separated text file, and then slice the string, since you know the value you want starts at the 13th index and ends at the 23rd index
import csv
sequences = [] # Create an empty list to contain all sequences
with open("example_val.txt") as f:
reader = csv.reader(f)
for row in reader:
element = row[1] # Get the second element in the row
seq = element[13:24] # Slice the element
sequences.append(seq) # Append to the list
print(seq) # Or print the current sequence
- Find the index of
<|endoftext|>
in the string. Relative to this indexi
, you know that your sequence starts at the indexi + len('<|endoftext|>')
, and ends ati + len('<|endoftext|>') + 10
with open("example_val.txt") as f:
for line in f:
i = line.find('<|endoftext|>')
seq_start = i + len('<|endoftext|>')
seq_end = seq_start + 10
seq = line[seq_start:seq_end+1] # Slice the line
sequences.append(seq) # Append to the list
print(seq) # Or print the current sequence