Re: regular expression questions (from someone without experience)

Well, the miRBase data format is spooky, but initially what I'll want to do is to clean up this file by removing the lines that have no interesting value to the analysis problem at hand, you can keep the original file intact and the cleaned up file(s) be generated from there and each can have their own subset of the original file and their own subproblem to be analysed that collectively culminate into achieving the overall analytical goal (N.B. You've not mentioned what you intend to do with the file sections you wanted captured).

I have to disagree with Moritz's reliance on the '*' to separate the records (this arose from the OP's description) because, these '*'s in here have a different meaning all together and they aren't record separators at all since they're used to reflect how two lines -or multiple ones for that matter- of letters are identical at the character level in that position, this is known as Sequence Alignment, so if these sequences weren't identical no '*' appears and thus two records can be inadvertently fused and if an alignment appeared mid-record then a record could be separated into two without having noticed so. On a related note you use the '-' to represent alignment gaps.

     gap
      |
      v
TTCCAG-CCAGCTTTGTGACT-CTA
TTCCAGCCCAGCTTTATGACT-GTA
TTCCAGCCCAGCTTCTTCGCT-CTG
****** ******       *  * 
    ^
    |
 identity
[download]

Back to topic, refining the file by purging the unwanted lines can probably allow you to use one of the BioPerl modules to tackle the entire problem without writing much code after all and can enable us to see a clear definition thereof in order to relevantly provide assistance.

You may want to read Perl and Bioinformatics in addition.

Excellence is an Endeavor of Persistence. A Year-Old Monk :D .

Comment on Re: regular expression questions (from someone without experience) Select or Download Code