Regular expression matching a multiline block of text
Question:
I’m having a bit of trouble getting a Python regex to work when matching against text that spans multiple lines. The example text is (n
is a newline)
some Varying TEXTn
n
DSJFKDAFJKDAFJDSAKFJADSFLKDLAFKDSAFn
[more of the above, ending with a newline]n
[yep, there is a variable number of lines here]n
n
(repeat the above a few hundred times).
I’d like to capture two things:
- the
some Varying TEXT
part
- all lines of uppercase text that come two lines below it in one
capture (I can strip out the newline characters later).
I’ve tried a few approaches:
re.compile(r"^>(w+)$$([.$]+)^$", re.MULTILINE) # try to capture both parts
re.compile(r"(^[^>][ws]+)$", re.MULTILINE|re.DOTALL) # just textlines
…and a lot of variations hereof with no luck. The last one seems to match the lines of text one by one, which is not what I really want. I can catch the first part, no problem, but I can’t seem to catch the 4-5 lines of uppercase text.
I’d like match.group(1)
to be some Varying Text
and group(2)
to be line1+line2+line3+etc until the empty line is encountered.
If anyone’s curious, it’s supposed to be a sequence of amino acids that make up a protein.
Answers:
find:
^>([^nr]+)[nr]([A-Znr]+)
1 = some_varying_text
2 = lines of all CAPS
Edit (proof that this works):
text = """> some_Varying_TEXT
DSJFKDAFJKDAFJDSAKFJADSFLKDLAFKDSAF
GATACAACATAGGATACA
GGGGGAAAAAAAATTTTTTTTT
CCCCAAAA
> some_Varying_TEXT2
DJASDFHKJFHKSDHF
HHASGDFTERYTERE
GAGAGAGAGAG
PPPPPAAAAAAAAAAAAAAAP
"""
import re
regex = re.compile(r'^>([^nr]+)[nr]([A-Znr]+)', re.MULTILINE)
matches = [m.groups() for m in regex.finditer(text)]
#NOTE can be sorter with matches = re.findall(pattern, text, re.MULTILINE)
for m in matches:
print 'Name: %snSequence:%s' % (m[0], m[1])
This will work:
>>> import re
>>> rx_sequence=re.compile(r"^(.+?)nn((?:[A-Z]+n)+)",re.MULTILINE)
>>> rx_blanks=re.compile(r"W+") # to remove blanks and newlines
>>> text="""Some varying text1
...
... AAABBBBBBCCCCCCDDDDDDD
... EEEEEEEFFFFFFFFGGGGGGG
... HHHHHHIIIIIJJJJJJJKKKK
...
... Some varying text 2
...
... LLLLLMMMMMMNNNNNNNOOOO
... PPPPPPPQQQQQQRRRRRRSSS
... TTTTTUUUUUVVVVVVWWWWWW
... """
>>> for match in rx_sequence.finditer(text):
... title, sequence = match.groups()
... title = title.strip()
... sequence = rx_blanks.sub("",sequence)
... print "Title:",title
... print "Sequence:",sequence
... print
...
Title: Some varying text1
Sequence: AAABBBBBBCCCCCCDDDDDDDEEEEEEEFFFFFFFFGGGGGGGHHHHHHIIIIIJJJJJJJKKKK
Title: Some varying text 2
Sequence: LLLLLMMMMMMNNNNNNNOOOOPPPPPPPQQQQQQRRRRRRSSSTTTTTUUUUUVVVVVVWWWWWW
Some explanation about this regular expression might be useful: ^(.+?)nn((?:[A-Z]+n)+)
- The first character (
^
) means “starting at the beginning of a line”. Be aware that it does not match the newline itself (same for $: it means “just before a newline”, but it does not match the newline itself).
- Then
(.+?)nn
means “match as few characters as possible (all characters are allowed) until you reach two newlines”. The result (without the newlines) is put in the first group.
[A-Z]+n
means “match as many upper case letters as possible until you reach a newline. This defines what I will call a textline.
((?:
textline)+)
means match one or more textlines but do not put each line in a group. Instead, put all the textlines in one group.
- You could add a final
n
in the regular expression if you want to enforce a double newline at the end.
- Also, if you are not sure about what type of newline you will get (
n
or r
or rn
) then just fix the regular expression by replacing every occurrence of n
by (?:n|rn?)
.
Try this:
re.compile(r"^(.+)n((?:n.+)+)", re.MULTILINE)
I think your biggest problem is that you’re expecting the ^
and $
anchors to match linefeeds, but they don’t. In multiline mode, ^
matches the position immediately following a newline and $
matches the position immediately preceding a newline.
Be aware, too, that a newline can consist of a linefeed (n
), a carriage-return (r
), or a carriage-return+linefeed (rn
). If you aren’t certain that your target text uses only linefeeds, you should use this more inclusive version of the regex:
re.compile(r"^(.+)(?:n|rn?)((?:(?:n|rn?).+)+)", re.MULTILINE)
BTW, you don’t want to use the DOTALL modifier here; you’re relying on the fact that the dot matches everything except newlines.
My preference.
lineIter= iter(aFile)
for line in lineIter:
if line.startswith( ">" ):
someVaryingText= line
break
assert len( lineIter.next().strip() ) == 0
acids= []
for line in lineIter:
if len(line.strip()) == 0:
break
acids.append( line )
At this point you have someVaryingText as a string, and the acids as a list of strings.
You can do "".join( acids )
to make a single string.
I find this less frustrating (and more flexible) than multiline regexes.
If each file only has one sequence of aminoacids, I wouldn’t use regular expressions at all. Just something like this:
def read_amino_acid_sequence(path):
with open(path) as sequence_file:
title = sequence_file.readline() # read 1st line
aminoacid_sequence = sequence_file.read() # read the rest
# some cleanup, if necessary
title = title.strip() # remove trailing white spaces and newline
aminoacid_sequence = aminoacid_sequence.replace(" ","").replace("n","")
return title, aminoacid_sequence
The following is a regular expression matching a multiline block of text:
import re
result = re.findall('(startText)(.+)((?:n.+)+)(endText)',input)
It can sometimes be comfortable to specify the flag directly inside the string, as an inline-flag:
"(?m)^A complete line$".
For example in unit tests, with assertRaisesRegex. That way, you don’t need to import re
, or compile your regex before calling the assert.
I’m having a bit of trouble getting a Python regex to work when matching against text that spans multiple lines. The example text is (n
is a newline)
some Varying TEXTn
n
DSJFKDAFJKDAFJDSAKFJADSFLKDLAFKDSAFn
[more of the above, ending with a newline]n
[yep, there is a variable number of lines here]n
n
(repeat the above a few hundred times).
I’d like to capture two things:
- the
some Varying TEXT
part - all lines of uppercase text that come two lines below it in one
capture (I can strip out the newline characters later).
I’ve tried a few approaches:
re.compile(r"^>(w+)$$([.$]+)^$", re.MULTILINE) # try to capture both parts
re.compile(r"(^[^>][ws]+)$", re.MULTILINE|re.DOTALL) # just textlines
…and a lot of variations hereof with no luck. The last one seems to match the lines of text one by one, which is not what I really want. I can catch the first part, no problem, but I can’t seem to catch the 4-5 lines of uppercase text.
I’d like match.group(1)
to be some Varying Text
and group(2)
to be line1+line2+line3+etc until the empty line is encountered.
If anyone’s curious, it’s supposed to be a sequence of amino acids that make up a protein.
find:
^>([^nr]+)[nr]([A-Znr]+)
1 = some_varying_text
2 = lines of all CAPS
Edit (proof that this works):
text = """> some_Varying_TEXT
DSJFKDAFJKDAFJDSAKFJADSFLKDLAFKDSAF
GATACAACATAGGATACA
GGGGGAAAAAAAATTTTTTTTT
CCCCAAAA
> some_Varying_TEXT2
DJASDFHKJFHKSDHF
HHASGDFTERYTERE
GAGAGAGAGAG
PPPPPAAAAAAAAAAAAAAAP
"""
import re
regex = re.compile(r'^>([^nr]+)[nr]([A-Znr]+)', re.MULTILINE)
matches = [m.groups() for m in regex.finditer(text)]
#NOTE can be sorter with matches = re.findall(pattern, text, re.MULTILINE)
for m in matches:
print 'Name: %snSequence:%s' % (m[0], m[1])
This will work:
>>> import re
>>> rx_sequence=re.compile(r"^(.+?)nn((?:[A-Z]+n)+)",re.MULTILINE)
>>> rx_blanks=re.compile(r"W+") # to remove blanks and newlines
>>> text="""Some varying text1
...
... AAABBBBBBCCCCCCDDDDDDD
... EEEEEEEFFFFFFFFGGGGGGG
... HHHHHHIIIIIJJJJJJJKKKK
...
... Some varying text 2
...
... LLLLLMMMMMMNNNNNNNOOOO
... PPPPPPPQQQQQQRRRRRRSSS
... TTTTTUUUUUVVVVVVWWWWWW
... """
>>> for match in rx_sequence.finditer(text):
... title, sequence = match.groups()
... title = title.strip()
... sequence = rx_blanks.sub("",sequence)
... print "Title:",title
... print "Sequence:",sequence
... print
...
Title: Some varying text1
Sequence: AAABBBBBBCCCCCCDDDDDDDEEEEEEEFFFFFFFFGGGGGGGHHHHHHIIIIIJJJJJJJKKKK
Title: Some varying text 2
Sequence: LLLLLMMMMMMNNNNNNNOOOOPPPPPPPQQQQQQRRRRRRSSSTTTTTUUUUUVVVVVVWWWWWW
Some explanation about this regular expression might be useful: ^(.+?)nn((?:[A-Z]+n)+)
- The first character (
^
) means “starting at the beginning of a line”. Be aware that it does not match the newline itself (same for $: it means “just before a newline”, but it does not match the newline itself). - Then
(.+?)nn
means “match as few characters as possible (all characters are allowed) until you reach two newlines”. The result (without the newlines) is put in the first group. [A-Z]+n
means “match as many upper case letters as possible until you reach a newline. This defines what I will call a textline.((?:
textline)+)
means match one or more textlines but do not put each line in a group. Instead, put all the textlines in one group.- You could add a final
n
in the regular expression if you want to enforce a double newline at the end. - Also, if you are not sure about what type of newline you will get (
n
orr
orrn
) then just fix the regular expression by replacing every occurrence ofn
by(?:n|rn?)
.
Try this:
re.compile(r"^(.+)n((?:n.+)+)", re.MULTILINE)
I think your biggest problem is that you’re expecting the ^
and $
anchors to match linefeeds, but they don’t. In multiline mode, ^
matches the position immediately following a newline and $
matches the position immediately preceding a newline.
Be aware, too, that a newline can consist of a linefeed (n
), a carriage-return (r
), or a carriage-return+linefeed (rn
). If you aren’t certain that your target text uses only linefeeds, you should use this more inclusive version of the regex:
re.compile(r"^(.+)(?:n|rn?)((?:(?:n|rn?).+)+)", re.MULTILINE)
BTW, you don’t want to use the DOTALL modifier here; you’re relying on the fact that the dot matches everything except newlines.
My preference.
lineIter= iter(aFile)
for line in lineIter:
if line.startswith( ">" ):
someVaryingText= line
break
assert len( lineIter.next().strip() ) == 0
acids= []
for line in lineIter:
if len(line.strip()) == 0:
break
acids.append( line )
At this point you have someVaryingText as a string, and the acids as a list of strings.
You can do "".join( acids )
to make a single string.
I find this less frustrating (and more flexible) than multiline regexes.
If each file only has one sequence of aminoacids, I wouldn’t use regular expressions at all. Just something like this:
def read_amino_acid_sequence(path):
with open(path) as sequence_file:
title = sequence_file.readline() # read 1st line
aminoacid_sequence = sequence_file.read() # read the rest
# some cleanup, if necessary
title = title.strip() # remove trailing white spaces and newline
aminoacid_sequence = aminoacid_sequence.replace(" ","").replace("n","")
return title, aminoacid_sequence
The following is a regular expression matching a multiline block of text:
import re
result = re.findall('(startText)(.+)((?:n.+)+)(endText)',input)
It can sometimes be comfortable to specify the flag directly inside the string, as an inline-flag:
"(?m)^A complete line$".
For example in unit tests, with assertRaisesRegex. That way, you don’t need to import re
, or compile your regex before calling the assert.