cyphedude has asked for the wisdom of the Perl Monks concerning the following question:

We all know (or should), that the g option in regular expressions doesn't actually give you "all occurrences" of a substring in a string. For example:

perl -wle '$_="ababa";(@matches) = /aba/g; print @matches;'

prints "aba" only once, even though there are TWO occurences of "aba" in the original string.

I'm faced with a similar situation. I'm trying to scan a string of DNA for ALL occurrences of /..g.{18}c/. For any given (random) sequence, there is about a 6% chance of matching the above substring. However, when used like the following:

my @string_var = $DNA_sequence =~ /..g.{18}c/ig;

then we only get 1/2 of the predicted amount. This is entirely due to the fact that given the substring length/composition, there is a very good chance that two matching patterns overlap (like in the "ababa" example).

So here's the question: How do I get around this so that I can match the missing substrings? Can it be done with a regexp, or substr, or is there a module that can help?
  • Comment on regexp g option: finds all occurrences? ha!!

Replies are listed 'Best First'.
Re: regexp g option: finds all occurrences? ha!!
by davido (Cardinal) on Oct 12, 2003 at 21:26 UTC
    Didn't this just come up yesterday? Ah yes, here it is: multiple matches with regexp... and it was two days ago, not yesterday. ;)

    There are some great solutions contained in that thread. I particluarly liked the one that used pos as an lvalueable function. I wish the answer I provided in that thread had been so slick. You may even see the person who started the thread in class on Monday. Be sure to thank him! lol.


    Dave


    "If I had my life to do over again, I'd be a plumber." -- Albert Einstein
      Amazingly, one of the solutions worked perfectly.

      my $regexp = "..g.{18}c"; my @antisense_oligos; $_ = $input_seq; do{ push @antisense_oligos, $1 if (m/^($regexp)/) } while ( s/^.// ); $,="\n"; print @antisense_oligos;


      When I get the chance, I'll try to put pos() to work.

      Thank you!!!
        Another way you may find quicker depending on how large your input string is is to use substr to nibble N chars at a time and then use the next pos to grab the next group of chars. This should be fairly speedy with little of the overhead of the above.


        -Waswas
Re: regexp g option: finds all occurrences? ha!!
by Abigail-II (Bishop) on Oct 12, 2003 at 23:26 UTC
    I would use positive lookahead with capturing:

    $ perl -wle '$_ = "abaca"; @a = /(?=(a.a))/g; print "@a"' aba aca

    Abigail