Regular expression matching a multiline block of text

Question:

I’m having a bit of trouble getting a Python regex to work when matching against text that spans multiple lines. The example text is (n is a newline)

some Varying TEXTn
n
DSJFKDAFJKDAFJDSAKFJADSFLKDLAFKDSAFn
[more of the above, ending with a newline]n
[yep, there is a variable number of lines here]n
n
(repeat the above a few hundred times).

I’d like to capture two things:

  • the some Varying TEXT part
  • all lines of uppercase text that come two lines below it in one
    capture (I can strip out the newline characters later).

I’ve tried a few approaches:

re.compile(r"^>(w+)$$([.$]+)^$", re.MULTILINE) # try to capture both parts
re.compile(r"(^[^>][ws]+)$", re.MULTILINE|re.DOTALL) # just textlines

…and a lot of variations hereof with no luck. The last one seems to match the lines of text one by one, which is not what I really want. I can catch the first part, no problem, but I can’t seem to catch the 4-5 lines of uppercase text.
I’d like match.group(1) to be some Varying Text and group(2) to be line1+line2+line3+etc until the empty line is encountered.

If anyone’s curious, it’s supposed to be a sequence of amino acids that make up a protein.

Asked By: Jan

||

Answers:

find:

^>([^nr]+)[nr]([A-Znr]+)

1 = some_varying_text

2 = lines of all CAPS

Edit (proof that this works):

text = """> some_Varying_TEXT

DSJFKDAFJKDAFJDSAKFJADSFLKDLAFKDSAF
GATACAACATAGGATACA
GGGGGAAAAAAAATTTTTTTTT
CCCCAAAA

> some_Varying_TEXT2

DJASDFHKJFHKSDHF
HHASGDFTERYTERE
GAGAGAGAGAG
PPPPPAAAAAAAAAAAAAAAP
"""

import re

regex = re.compile(r'^>([^nr]+)[nr]([A-Znr]+)', re.MULTILINE)
matches = [m.groups() for m in regex.finditer(text)]
#NOTE can be sorter with matches = re.findall(pattern, text, re.MULTILINE)

for m in matches:
    print 'Name: %snSequence:%s' % (m[0], m[1])
Answered By: Jason Coon

This will work:

>>> import re
>>> rx_sequence=re.compile(r"^(.+?)nn((?:[A-Z]+n)+)",re.MULTILINE)
>>> rx_blanks=re.compile(r"W+") # to remove blanks and newlines
>>> text="""Some varying text1
...
... AAABBBBBBCCCCCCDDDDDDD
... EEEEEEEFFFFFFFFGGGGGGG
... HHHHHHIIIIIJJJJJJJKKKK
...
... Some varying text 2
...
... LLLLLMMMMMMNNNNNNNOOOO
... PPPPPPPQQQQQQRRRRRRSSS
... TTTTTUUUUUVVVVVVWWWWWW
... """
>>> for match in rx_sequence.finditer(text):
...   title, sequence = match.groups()
...   title = title.strip()
...   sequence = rx_blanks.sub("",sequence)
...   print "Title:",title
...   print "Sequence:",sequence
...   print
...
Title: Some varying text1
Sequence: AAABBBBBBCCCCCCDDDDDDDEEEEEEEFFFFFFFFGGGGGGGHHHHHHIIIIIJJJJJJJKKKK

Title: Some varying text 2
Sequence: LLLLLMMMMMMNNNNNNNOOOOPPPPPPPQQQQQQRRRRRRSSSTTTTTUUUUUVVVVVVWWWWWW

Some explanation about this regular expression might be useful: ^(.+?)nn((?:[A-Z]+n)+)

  • The first character (^) means “starting at the beginning of a line”. Be aware that it does not match the newline itself (same for $: it means “just before a newline”, but it does not match the newline itself).
  • Then (.+?)nn means “match as few characters as possible (all characters are allowed) until you reach two newlines”. The result (without the newlines) is put in the first group.
  • [A-Z]+n means “match as many upper case letters as possible until you reach a newline. This defines what I will call a textline.
  • ((?:textline)+) means match one or more textlines but do not put each line in a group. Instead, put all the textlines in one group.
  • You could add a final n in the regular expression if you want to enforce a double newline at the end.
  • Also, if you are not sure about what type of newline you will get (n or r or rn) then just fix the regular expression by replacing every occurrence of n by (?:n|rn?).
Answered By: MiniQuark

Try this:

re.compile(r"^(.+)n((?:n.+)+)", re.MULTILINE)

I think your biggest problem is that you’re expecting the ^ and $ anchors to match linefeeds, but they don’t. In multiline mode, ^ matches the position immediately following a newline and $ matches the position immediately preceding a newline.

Be aware, too, that a newline can consist of a linefeed (n), a carriage-return (r), or a carriage-return+linefeed (rn). If you aren’t certain that your target text uses only linefeeds, you should use this more inclusive version of the regex:

re.compile(r"^(.+)(?:n|rn?)((?:(?:n|rn?).+)+)", re.MULTILINE)

BTW, you don’t want to use the DOTALL modifier here; you’re relying on the fact that the dot matches everything except newlines.

Answered By: Alan Moore

My preference.

lineIter= iter(aFile)
for line in lineIter:
    if line.startswith( ">" ):
         someVaryingText= line
         break
assert len( lineIter.next().strip() ) == 0
acids= []
for line in lineIter:
    if len(line.strip()) == 0:
        break
    acids.append( line )

At this point you have someVaryingText as a string, and the acids as a list of strings.
You can do "".join( acids ) to make a single string.

I find this less frustrating (and more flexible) than multiline regexes.

Answered By: S.Lott

If each file only has one sequence of aminoacids, I wouldn’t use regular expressions at all. Just something like this:

def read_amino_acid_sequence(path):
    with open(path) as sequence_file:
        title = sequence_file.readline() # read 1st line
        aminoacid_sequence = sequence_file.read() # read the rest

    # some cleanup, if necessary
    title = title.strip() # remove trailing white spaces and newline
    aminoacid_sequence = aminoacid_sequence.replace(" ","").replace("n","")
    return title, aminoacid_sequence
Answered By: MiniQuark

The following is a regular expression matching a multiline block of text:

import re
result = re.findall('(startText)(.+)((?:n.+)+)(endText)',input)
Answered By: Punnerud

It can sometimes be comfortable to specify the flag directly inside the string, as an inline-flag:

"(?m)^A complete line$".

For example in unit tests, with assertRaisesRegex. That way, you don’t need to import re, or compile your regex before calling the assert.

Answered By: Eric Duminil
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.