Hello, fellow monks!
I have been writing a parser for some Protein Data Bank files, for a bioinformatics project. I have no problem extracting the sequences I need, but I am stumped by the titles. Here's the problem:
The files start out in this format:
HEADER METAL BINDING PROTEIN 31-AUG-98 1BSW
+
TITLE ACUTOLYSIN A FROM SNAKE VENOM OF AGKISTRODON ACUTUS AT PH
+
TITLE 2 7.5
+
COMPND MOL_ID: 1;
+
COMPND 2 MOLECULE: ACUTOLYSIN A;
...
The lines beginning with
TITLE are the ones I'm interested in grabbing. There's a little caveat in that after the first line, the line number gets prepended to the title fragment. So in this example, the actual title is "Acutolysin A from snake venom of agkistrodon acutus at pH 7.5".
So far so dull. But later in the file, sometimes much later, there may be lines that also begin with
TITLE. We want to ignore those.
Assuming the following constraints:
- We treat the file as an array ( no slurping into a scalar )
- There is no way to distinguish the later TITLE elements by pattern matching.
Can anyone think of an elegant way to grab the first block of 1+ contiguous
TITLE lines, and stop?
I know how to do this with regular expressions on a scalar, and how to do it in a very unelegant way by setting flags in a loop, but I suspect there is greater wisdom out there and can't wait to learn.
Special bonus to anyone who can tell me what an
agkistrodon acutus is, and how deadly is its bite.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
|
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.