Re: count backrefenence regex

in reply to count backrefenence regex

Oneliners are great, but not for readability, and when you want people to be able to help you easily, smashing this code into a oneliner is not the best way to present a Short, Self-Contained, Correct Example. Also, Use strict and warnings.

if i 1st assign $+{repeat} to a variable, i get the substring

The Variables related to regular expressions are reset by each successful regex operation and are dynamically scoped, so if you want to use them later, you should generally always store them into other variables, and only use them if the match is successful in the first place. (related recent thread in regards to the variables' scoping: Re: why is $1 cleared at end of an inline sub?)

adding //g back gets back to the non-functional state

As per Regexp Quote Like Operators:

In scalar context, each execution of m//g finds the next match, returning true if it matches, and false if there is no further match. The position after the last match can be read or set using the pos() function; see "pos" in perlfunc. A failed match normally resets the search position to the beginning of the string, but you can avoid that by adding the /c modifier (for example, m//gc). Modifying the target string also resets the search position.

The thing to note here is that pos is per string. You're matching against $s in the while's condition, but then also matching against $s again inside the while's body, each operation using and affecting $s's pos.

As far as I can tell, what your current algorithm is trying to do is count the ocurrences of repeated substrings immediately, each time you find them. This seems quite inefficient.

You've got a few other issues in your code: Your first two examples have "Useless use of hash element in void context" because you just have $+{repeat}; all on its own, and second, $x=int(1000*rand()) and then using $x as an index to @a is going to cause a ton of nonexistent array elements to be picked. Also, random strings are not usually a good idea for testing during development, since tests should be repeatable.

Another issue that I see is that your current regex will consume not only the match (?<repeat>\w{3,}), but also all characters between that match (\w*) and the repetition itself (\g{repeat}), so all of those latter characters won't be checked for repetitions. This can be solved with zero-width Lookaround Assertions, however, you haven't specified what should happen if the sequences overlap and so on. That's why test cases are important.

Anyway, here's a starting point for what I think you might want. Note how I'm simply using a hash to count occurrences.

use warnings;
use strict;
use Test::More;

sub count_reps {
    my $data = shift;
    my %seqs;
    while ( $data =~ m{ (?<repeat>\w{3,}) (?= \w* \g{repeat} ) }xg ) {
        $seqs{ $+{repeat} }++;
    }
    return \%seqs;
}


is_deeply count_reps('AGCAGC'),
    { AGC => 1 };

is_deeply count_reps('AATGCAATCGCAGCAGCA'),
    { AAT => 1, GCA => 3 };

is_deeply count_reps('AGCTACCCAGCTAGGGAGCTA'),
    { AGCTA => 2 };

done_testing;
[download]

Minor edits for clarity.

In Section Seekers of Perl Wisdom