aquinom has asked for the wisdom of the Perl Monks concerning the following question:

Thanks everyoene for your help. This is the finished code in case you're interested.
#!/usr/bin/perl use strict; use warnings; my $infile = $ARGV[0]; unless (open (INFILE, "$infile")){ print STDERR "Can't open $infile $!\n"; die; } my ($header, $count , $match) = ('',0,0); print "\n\n"; #This algorithm will work only when the sequence is on one line while (<INFILE>) { chomp; if ($_ =~ />(.*)/) { #a header line $header = $1; $count++; #keep running total of seque +nce number } else { #not a header my $i = 0; while ( $_ =~ /([VILMFWCA]{8,})/g) { my $domain = $1; my $len = length $doma +in; $len--; if ($i == 0){ print "Hydroph +obic strecth found in: $header\n"; $match++; $i++; } print "$domain\n"; print "The match was a +t potistion: ", pos() - $len, "\n";; } if ($i > 0){ print "\n\n"; } } } close INFILE; print "Hydrophobic region(s) found in $match sequences out of $count s +equences\n";

Replies are listed 'Best First'.
Re: need help with a regex
by kennethk (Abbot) on Oct 22, 2010 at 18:59 UTC
    Rather than just the textual description you included above, it is generally much better to include a small sample of your actual input wrapped in code tags. Note that your post has been misformatted as a result. See Writeup Formatting Tips. Also, please provide sample output, as I have a great deal of trouble following your spec. See How do I post a question effectively?.

    I note that your posted code has several basic syntactic issues. Are you learning Perl from a book, or are you attempting to modify someone else's script? Do you have any background in programming in general? These issues include using a capitalized AND (vs. and, Perl is case sensitive), Modifiers on your regular expressions that are inappropriate for what you are trying to do, and a wholly incorrect block structure. Telling us where you are will let us better guide your development as a programmer and point you to more useful resources.

    I am loathe to leave this node without a concrete bit of suggested code, but I am at a loss for how to even modify the posted code to "work".

      Sorry about the messy code I wrote I know it wasn't actually legit I was just in a hurry and trying to get the basic idea across. Here's a snip from what I'm parsing:

      >P30450 | Homo sapiens (Human). | NCBI_TaxID=9606; | 365 | Name=HLA-A; Synonyms=HLAA;M

      MAVMAPRTLVLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDT
      QFVRFDSDAASQRMEPRAPWIEQEGPEYWDRNTRNVKAHSQTDRANLGTLRGYYNQSEDGS
      TIQRMYGCDVGPDGRFLRGYQQDAYDGKDYIALNEDLRSWTAADMAAQITQRKW
      ETAHEAEQWRAYLEGRCVEWLRRYLENGKETLQRTDAPKTHMTHHAVSDHEATL
      RCWALSFYPAEITLTWQRDGEDQTQDTELVETRPAGDGTFQKWASVVVPSGQEQ
      RYTCHVQHEGLPKPLTLRWEPSSQPTIPIVGIIAGLVLFGAVIAGAVVAAVMWRRKS
      SDRKGGSYSQAASSDSAQGSDMSLTACKV

      and the output should look like:
      Hydrophobic stretch found in: P30450 | Homo sapiens (Human). | NCBI_TaxID=9606; | 365 | Name=HLA-A; Synonyms=HLAA;
      AVVAAVMW
      The match was at position: 325
      Hydrophobic stretch found in:
      A7MBM2 | Homo sapiens (Human). | NCBI_TaxID=9606; | 1401 | Name=DISP2; Synonyms=DISPB, KIAA1742;
      VAVLMLCLAVIFLC
      The match was at potistion: 170
      LLALVAIFF
      The match was at potistion: 493
      IWICWFAALAA
      The match was at potistion: 705
      LALALAFA
      The match was at potistion: 970
      Hydrophobic region(s) found in 2 sequences out of 15 sequences

        Like I said, where are the <code> tags around your input and output? Embedding text in HTML is notorious for changing formatting. A regular expression I write will very likely fail because the character sequence displayed on the screen will differ from what you have in your file. I also note your "desired output" is significantly different from what you specified in the original post. For example, the word "Hydrophobic" appears nowhere the OP and the word "contains" appears nowhere in the new spec.

        Sorry about the messy code I wrote I know it wasn't actually legit
        Writing pseudocode is considered good practice when you don't know a language. That means explain clearly what you want an algorithm to do, not just posting gibberish from the target language.

        I was just in a hurry and trying to get the basic idea across
        Which you did not do, nor have you done effectively yet. Perhaps the more verbose How To Ask Questions The Smart Way may provide clear guidance on how to effective construct questions on internet forums.

        You still did not answer my questions on your own experience level. I will assume you are an extreme novice with access to a working script crafted by another. I can give you aid on this particular problem, but if you expect to get anywhere in the long run, you will need to learn some very basic coding concepts you apparently lack.

        In examining your desired output, I note that several of your character sequences do not appear in your text block, e.g. "VAVLMLCLAVIFLC", "LLALVAIFF", ... I note that "AVVAAVMW" is cited at "position: 325". This makes me suspect that the orginal file you are parsing does not contain the white space you are posting or modifies the input before filtering.

        I have modified your originally posted code to do something like what you request, though the numbers are wrong.

        #!/usr/bin/perl use strict; use warnings; local $/; # Slurp my $content = <DATA>; my ($header) = $content =~ /^(>.*?)$/m; while ($content =~ /^[\w]+?([VMFWLCA]{8,})[\w]+?$/mg) { my $sequence = $1; print $header, "contains $sequence at position ", pos($content) - +length($sequence), "\n"; } __DATA__ >P30450 | Homo sapiens (Human). | NCBI_TaxID=9606; | 365 | Name=HLA-A; + Synonyms=HLAA;M MAVMAPRTLVLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDT QFVRFDSDAASQRMEPRAPWIEQEGPEYWDRNTRNVKAHSQTDRANLGTLRGYYNQSEDGS TIQRMYGCDVGPDGRFLRGYQQDAYDGKDYIALNEDLRSWTAADMAAQITQRKW ETAHEAEQWRAYLEGRCVEWLRRYLENGKETLQRTDAPKTHMTHHAVSDHEATL RCWALSFYPAEITLTWQRDGEDQTQDTELVETRPAGDGTFQKWASVVVPSGQEQ RYTCHVQHEGLPKPLTLRWEPSSQPTIPIVGIIAGLVLFGAVIAGAVVAAVMWRRKS SDRKGGSYSQAASSDSAQGSDMSLTACKV

        outputs

        >P30450 | Homo sapiens (Human). | NCBI_TaxID=9606; | 365 | Name=HLA-A; Synonyms=HLAA;Mcontains AVVAAVMW at position 420

        I leave modifying it to get what you expect as an exercise for you. You will likely want to read the documentation at perlsyn, perlre, perlretut, pos and length.

Re: need help with a regex
by halfcountplus (Hermit) on Oct 22, 2010 at 19:12 UTC

    I think you will get much better help if you are more specific and include a actual example of what you want want to parse -- since you have admitted your regexp doesn't work, it is hard to deduce what you are trying to do.

    Eg, [VMFWLCA]{,8} is invalid (there must something before the comma), but presuming you mean "at least once, and a maximum of 8 times" (which would be {1,8}) then $1 could be any of these:

    VVVVV
    FMFMWWWL
    C

    Is that what you are trying to match?

      {,8} is perfectly valid it means a maximum of 8 times, but I meant to write {8,} anyways which means at least 8 times.
        {,8} is perfectly valid it means a maximum of 8 times

        {,8} is a perfectly valid string which in a regular expression matches the character '{' followed by the character ',' followed by the character '8' followed by the character '}'.    Perhaps you are thinking of the quantifier {0,8}?

Re: need help with a regex
by umasuresh (Hermit) on Oct 22, 2010 at 19:04 UTC