Re^4: need help with a regex

those sequences you don't see because they are in the file I'm parsing not what I posted otherwise I'd have to post the entire 3000 line file. my experience is pretty novice I've been working with perl for about 3 months but I have to write this code myself I don't have any other person's script that I'm editing to make it. It's a bit more complicated than what you gave me as a response in your code because If you match the first in the header, store it. If then on the next line iteration you have a match to the residues, print out the header and the match. If there's additional matches in the sequence, like the output shows, only print the name of the sequence once, but print out the other hits, like I've shown in the output. I can easily pull out all the sequences from the file or all the headers but I want to store the header only if the sequence(s) are present on the subsequent line then print them all at once. my issue is not with working with the regular expressions but more on deciding which kind of loop would be best suited for the task. Thank you and sorry about my posting etiquette.

Comment on Re^4: need help with a regex

Replies are listed 'Best First'.
Re^5: need help with a regex by kennethk (Abbot) on Oct 22, 2010 at 21:28 UTC
Everyone learns by making mistakes. Demonstrate you can learn from those, and you will be well-regarded on this forum. If you post input and desired output (good), make sure they correspond. In this case, only including the first section of output would have been appropriate, so the two match up. I believe this works closer to your spec; it clears the contents of `$header` after the first print, so it will only appear once. #!/usr/bin/perl use strict; use warnings; local $/; # Slurp my $content = <DATA>; my ($header) = $content =~ /^(>.?\n)/m; while ($content =~ /^[\w]?([VMFWLCA]{8,})[\w]?$/mg) { my $sequence = $1; print $header, "contains $sequence at position ", pos($content) - +length($sequence), "\n"; $header = ""; } __DATA__ >P30450 \| Homo sapiens (Human). \| NCBI_TaxID=9606; \| 365 \| Name=HLA-A; + Synonyms=HLAA;M MAVMAPRTLVLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDT QFVRFDSDAASQRMEPRAPWIEQEGPEYWDRNTRNVKAHSQTDRANLGTLRGYYNQSEDGS TIQRMYGCDVGPDGRFLRGYQQDAYDGKDYIALNEDLRSWTAADMAAQITQRKW ETAHEAEQWRAYLEGRCVEWLRRYLENGKETLQRTDAPKTHMTHHAVSDHEATL RCWALSFYPAEITLTWQRDGEDQTQDTELVETRPAGDGTFQKWASVVVPSGQEQ RYTCHVQHEGLPKPLTLRWEPSSQPTIPIVGIIAGLVLFGAVIAGAVVAAVMWRRKS SDRKGGSYSQAASSDSAQGSDMSLTACKVVAVLMLCLAVIFLC [download] outputs: `>P30450 \| Homo sapiens (Human). \| NCBI_TaxID=9606; \| 365 \| Name=HLA-A; + Synonyms=HLAA;M contains AVVAAVMW at position 420 contains VVAVLMLCLAV at position 461` [download] Note I've changed you `+` to `` so that your regular expression can also match entries at the start and ends of lines, not just in the middle. I assume this was an oversight on your part; sorry if this assumption is incorrect.	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^5: need help with a regex
by kennethk (Abbot) on Oct 22, 2010 at 21:28 UTC

Everyone learns by making mistakes. Demonstrate you can learn from those, and you will be well-regarded on this forum.

If you post input and desired output (good), make sure they correspond. In this case, only including the first section of output would have been appropriate, so the two match up.

I believe this works closer to your spec; it clears the contents of $header after the first print, so it will only appear once.

#!/usr/bin/perl
use strict;
use warnings;

local $/; # Slurp
my $content = <DATA>;

my ($header) = $content =~ /^(>.*?\n)/m;
while ($content =~ /^[\w]*?([VMFWLCA]{8,})[\w]*?$/mg) {
    my $sequence = $1;
    print $header, "contains $sequence at position ", pos($content) - 
+length($sequence), "\n";
    $header = "";
}
__DATA__
>P30450 | Homo sapiens (Human). | NCBI_TaxID=9606; | 365 | Name=HLA-A;
+ Synonyms=HLAA;M

MAVMAPRTLVLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDT
QFVRFDSDAASQRMEPRAPWIEQEGPEYWDRNTRNVKAHSQTDRANLGTLRGYYNQSEDGS
TIQRMYGCDVGPDGRFLRGYQQDAYDGKDYIALNEDLRSWTAADMAAQITQRKW
ETAHEAEQWRAYLEGRCVEWLRRYLENGKETLQRTDAPKTHMTHHAVSDHEATL
RCWALSFYPAEITLTWQRDGEDQTQDTELVETRPAGDGTFQKWASVVVPSGQEQ
RYTCHVQHEGLPKPLTLRWEPSSQPTIPIVGIIAGLVLFGAVIAGAVVAAVMWRRKS
SDRKGGSYSQAASSDSAQGSDMSLTACKVVAVLMLCLAVIFLC
[download]

outputs:

>P30450 | Homo sapiens (Human). | NCBI_TaxID=9606; | 365 | Name=HLA-A;
+ Synonyms=HLAA;M
contains AVVAAVMW at position 420
contains VVAVLMLCLAV at position 461
[download]

Note I've changed you + to * so that your regular expression can also match entries at the start and ends of lines, not just in the middle. I assume this was an oversight on your part; sorry if this assumption is incorrect.

[reply]
[d/l]
[select]