regexp g option: finds all occurrences? ha!!

cyphedude has asked for the wisdom of the Perl Monks concerning the following question:

We all know (or should), that the g option in regular expressions doesn't actually give you "all occurrences" of a substring in a string. For example:

perl -wle '$_="ababa";(@matches) = /aba/g; print @matches;'

prints "aba" only once, even though there are TWO occurences of "aba" in the original string.

I'm faced with a similar situation. I'm trying to scan a string of DNA for ALL occurrences of /..g.{18}c/. For any given (random) sequence, there is about a 6% chance of matching the above substring. However, when used like the following:

my @string_var = $DNA_sequence =~ /..g.{18}c/ig;

then we only get 1/2 of the predicted amount. This is entirely due to the fact that given the substring length/composition, there is a very good chance that two matching patterns overlap (like in the "ababa" example).

So here's the question: How do I get around this so that I can match the missing substrings? Can it be done with a regexp, or substr, or is there a module that can help?

Comment on regexp g option: finds all occurrences? ha!!

Replies are listed 'Best First'.
Re: regexp g option: finds all occurrences? ha!! by davido (Cardinal) on Oct 12, 2003 at 21:26 UTC
Didn't this just come up yesterday? Ah yes, here it is: multiple matches with regexp... and it was two days ago, not yesterday. ;) There are some great solutions contained in that thread. I particluarly liked the one that used pos as an lvalueable function. I wish the answer I provided in that thread had been so slick. You may even see the person who started the thread in class on Monday. Be sure to thank him! lol. Dave "If I had my life to do over again, I'd be a plumber." -- Albert Einstein	[reply]
Re: Re: regexp g option: finds all occurrences? ha!! by cyphedude (Initiate) on Oct 12, 2003 at 22:15 UTC
Amazingly, one of the solutions worked perfectly. `my $regexp = "..g.{18}c"; my @antisense_oligos; $_ = $input_seq; do{ push @antisense_oligos, $1 if (m/^($regexp)/) } while ( s/^.// ); $,="\n"; print @antisense_oligos;` [download] When I get the chance, I'll try to put pos() to work. Thank you!!!	[reply] [d/l]
Re: Re: Re: regexp g option: finds all occurrences? ha!! by waswas-fng (Curate) on Oct 12, 2003 at 23:17 UTC
Another way you may find quicker depending on how large your input string is is to use substr to nibble N chars at a time and then use the next pos to grab the next group of chars. This should be fairly speedy with little of the overhead of the above. -Waswas	[reply]
Re: regexp g option: finds all occurrences? ha!! by Abigail-II (Bishop) on Oct 12, 2003 at 23:26 UTC
I would use positive lookahead with capturing: `$ perl -wle '$_ = "abaca"; @a = /(?=(a.a))/g; print "@a"' aba aca` [download] Abigail	[reply] [d/l]