Selecting all lines/strings that fall between pattern in text file
Question:
Given a text file that looks like this when loaded:
>rice1 1ALBRGHAER
NNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNN
>peanuts2 2LAEKaq
SSSSSSSSSSS
>OIL3 3hkasUGSV
ppppppppppppppppppppp
ppppppppppppppppppppp
How can I extract all lines that fall between lines that contain ‘>’ and the last lines where there is no ending ‘>’ ?
For example, the result should look like this
result = ['NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN','SSSSSSSSSSS','pppppppppppppppppppppppppppppppppppppppppp']
I’m realizing what I did won’t work because its looking for text between each new line and ‘>’. Running this just gives me empty strings.
def findtext(inputtextfile, start, end):
try:
pattern=rf'{start}(.*?){end}'
return re.findall(pattern, inputtextfile)
except ValueError:
return -1
result = findtext(inputtextfile,"n", ">")
Answers:
Maybe try splitting on rows that start with >
, that way you get back a list of the data between and can join those after replacing the n
s = """>rice1 1ALBRGHAER
NNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNN
>peanuts2 2LAEKaq
SSSSSSSSSSS
>OIL3 3hkasUGSV
ppppppppppppppppppppp
ppppppppppppppppppppp"""
def findtext(inputtextfile, start, end):
import re
try:
return [''.join(x.replace('n','')) for x in list(filter(None,re.split(f'{start}.*{end}',s)))]
except ValueError:
return -1
Trying with your provided case
findtext(s, '>','n')
Output
['NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN',
'SSSSSSSSSSS',
'pppppppppppppppppppppppppppppppppppppppppp']
One option could be using re.split on the line that starts with >
and then remove all the whitespace chars from the parts.
text = (">rice1 1ALBRGHAERn"
"NNNNNNNNNNNNNNNNNNNNNn"
"NNNNNNNNNNNNNNNNNNNNNn"
">peanuts2 2LAEKaqn"
"SSSSSSSSSSSn"
">OIL3 3hkasUGSVn"
"pppppppppppppppppppppn"
"ppppppppppppppppppppp")
def findtext(inputtextfile):
import re
pattern = r"^>.*"
try:
return [re.sub(r"s+", "", s) for s in re.split(pattern, inputtextfile, 0, re.M) if s]
except ValueError:
return -1
print(findtext(text))
Output (formatted a bit)
[
'NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN',
'SSSSSSSSSSS',
'pppppppppppppppppppppppppppppppppppppppppp'
]
See a Python demo.
Given a text file that looks like this when loaded:
>rice1 1ALBRGHAER
NNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNN
>peanuts2 2LAEKaq
SSSSSSSSSSS
>OIL3 3hkasUGSV
ppppppppppppppppppppp
ppppppppppppppppppppp
How can I extract all lines that fall between lines that contain ‘>’ and the last lines where there is no ending ‘>’ ?
For example, the result should look like this
result = ['NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN','SSSSSSSSSSS','pppppppppppppppppppppppppppppppppppppppppp']
I’m realizing what I did won’t work because its looking for text between each new line and ‘>’. Running this just gives me empty strings.
def findtext(inputtextfile, start, end):
try:
pattern=rf'{start}(.*?){end}'
return re.findall(pattern, inputtextfile)
except ValueError:
return -1
result = findtext(inputtextfile,"n", ">")
Maybe try splitting on rows that start with >
, that way you get back a list of the data between and can join those after replacing the n
s = """>rice1 1ALBRGHAER
NNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNN
>peanuts2 2LAEKaq
SSSSSSSSSSS
>OIL3 3hkasUGSV
ppppppppppppppppppppp
ppppppppppppppppppppp"""
def findtext(inputtextfile, start, end):
import re
try:
return [''.join(x.replace('n','')) for x in list(filter(None,re.split(f'{start}.*{end}',s)))]
except ValueError:
return -1
Trying with your provided case
findtext(s, '>','n')
Output
['NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN',
'SSSSSSSSSSS',
'pppppppppppppppppppppppppppppppppppppppppp']
One option could be using re.split on the line that starts with >
and then remove all the whitespace chars from the parts.
text = (">rice1 1ALBRGHAERn"
"NNNNNNNNNNNNNNNNNNNNNn"
"NNNNNNNNNNNNNNNNNNNNNn"
">peanuts2 2LAEKaqn"
"SSSSSSSSSSSn"
">OIL3 3hkasUGSVn"
"pppppppppppppppppppppn"
"ppppppppppppppppppppp")
def findtext(inputtextfile):
import re
pattern = r"^>.*"
try:
return [re.sub(r"s+", "", s) for s in re.split(pattern, inputtextfile, 0, re.M) if s]
except ValueError:
return -1
print(findtext(text))
Output (formatted a bit)
[
'NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN',
'SSSSSSSSSSS',
'pppppppppppppppppppppppppppppppppppppppppp'
]
See a Python demo.