Regular expression matching a multiline block of text
Try this:
re.compile(r"^(.+)\n((?:\n.+)+)", re.MULTILINE)
I think your biggest problem is that you're expecting the ^
and $
anchors to match linefeeds, but they don't. In multiline mode, ^
matches the position immediately following a newline and $
matches the position immediately preceding a newline.
Be aware, too, that a newline can consist of a linefeed (\n
), a carriage-return (\r
), or a carriage-return+linefeed (\r\n
). If you aren't certain that your target text uses only linefeeds, you should use this more inclusive version of the regex:
re.compile(r"^(.+)(?:\n|\r\n?)((?:(?:\n|\r\n?).+)+)", re.MULTILINE)
BTW, you don't want to use the DOTALL modifier here; you're relying on the fact that the dot matches everything except newlines.
This will work:
>>> import re>>> rx_sequence=re.compile(r"^(.+?)\n\n((?:[A-Z]+\n)+)",re.MULTILINE)>>> rx_blanks=re.compile(r"\W+") # to remove blanks and newlines>>> text="""Some varying text1...... AAABBBBBBCCCCCCDDDDDDD... EEEEEEEFFFFFFFFGGGGGGG... HHHHHHIIIIIJJJJJJJKKKK...... Some varying text 2...... LLLLLMMMMMMNNNNNNNOOOO... PPPPPPPQQQQQQRRRRRRSSS... TTTTTUUUUUVVVVVVWWWWWW... """>>> for match in rx_sequence.finditer(text):... title, sequence = match.groups()... title = title.strip()... sequence = rx_blanks.sub("",sequence)... print "Title:",title... print "Sequence:",sequence... print...Title: Some varying text1Sequence: AAABBBBBBCCCCCCDDDDDDDEEEEEEEFFFFFFFFGGGGGGGHHHHHHIIIIIJJJJJJJKKKKTitle: Some varying text 2Sequence: LLLLLMMMMMMNNNNNNNOOOOPPPPPPPQQQQQQRRRRRRSSSTTTTTUUUUUVVVVVVWWWWWW
Some explanation about this regular expression might be useful: ^(.+?)\n\n((?:[A-Z]+\n)+)
- The first character (
^
) means "starting at the beginning of a line". Be aware that it does not match the newline itself (same for $: it means "just before a newline", but it does not match the newline itself). - Then
(.+?)\n\n
means "match as few characters as possible (all characters are allowed) until you reach two newlines". The result (without the newlines) is put in the first group. [A-Z]+\n
means "match as many upper case letters as possible until you reach a newline. This defines what I will call a textline.((?:
textline)+)
means match one or more textlines but do not put each line in a group. Instead, put all the textlines in one group.- You could add a final
\n
in the regular expression if you want to enforce a double newline at the end. - Also, if you are not sure about what type of newline you will get (
\n
or\r
or\r\n
) then just fix the regular expression by replacing every occurrence of\n
by(?:\n|\r\n?)
.