in reply to regular expression questions (from someone without experience)
I have to disagree with Moritz's reliance on the '*' to separate the records (this arose from the OP's description) because, these '*'s in here have a different meaning all together and they aren't record separators at all since they're used to reflect how two lines -or multiple ones for that matter- of letters are identical at the character level in that position, this is known as Sequence Alignment, so if these sequences weren't identical no '*' appears and thus two records can be inadvertently fused and if an alignment appeared mid-record then a record could be separated into two without having noticed so. On a related note you use the '-' to represent alignment gaps.
Back to topic, refining the file by purging the unwanted lines can probably allow you to use one of the BioPerl modules to tackle the entire problem without writing much code after all and can enable us to see a clear definition thereof in order to relevantly provide assistance.gap | v TTCCAG-CCAGCTTTGTGACT-CTA TTCCAGCCCAGCTTTATGACT-GTA TTCCAGCCCAGCTTCTTCGCT-CTG ****** ****** * * ^ | identity
You may want to read Perl and Bioinformatics in addition.
|
---|