rocketperl has asked for the wisdom of the Perl Monks concerning the following question:

Good day monkers!!

I'm trying to find consequtive, duplicating alphabets that are greater than 5 in length.

$sequ{$K} has a string of alphabets in upper case

With the regex that i used below, I expect my @match array to have all the alphabets that are duplicated more than 4 times.

However, this regex works perfectly well with some inputs, while it doesn'nt show me all the repeated letters and in some cases more number of duplicated letters than the original.

Please can someone tell me, what went wrong with this regex/ or an alternative way of finiding duplicates?

if (@match=$sequ{$k}=~m/([A-Z])\g1{5,}/g)

Replies are listed 'Best First'.
Re: Backreference regex help
by Athanasius (Archbishop) on Nov 01, 2014 at 06:51 UTC

    Hello rocketperl,

    Your regex is pretty close to being correct. Just two tweaks needed:

    1. You need to count the capture ([A-Z]) in the total. So, to get 5 or more, you need m/([A-Z])\g2{4,}/.

    2. To see the repeated letters, you need another capture:

      #! perl use strict; use warnings; use constant MIN_MATCHES => 5; my %sequ = ( 1 => 'abcXXXXXef', 2 => 'abcYYYYYYef', 3 => 'abZZZZef', ); my $match_at_least = MIN_MATCHES - 1; for my $k (1 .. 3) { if (my @matches = $sequ{$k} =~ m/(([A-Z])\g2{$match_at_least,})/) { print "Match: $matches[1] in $matches[0] \n"; } }

      Output:

      16:47 >perl 1069_SoPW.pl Match: X in XXXXX Match: Y in YYYYYY 16:47 >

      Note that capture groups are numbered according to the position of the left parenthesis, so the whole sequence of repeated characters is \g1 and the first character is \g2. I have also removed the /g modifier to keep things simple.

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      I can't thank you enough!!! Just what I wanted. You have saved me a lot of time. Thanks
Re: Backreference regex help
by Loops (Curate) on Nov 01, 2014 at 06:56 UTC

    Hi there,

    It would be most helpful if you gave examples of input that doesn't work. Hard to know what you're seeing otherwise. That said, the match you're performing actually tests for strings that occur 6 or more times. The initial group ([A-Z]) counts as one character, and then five or more {5,}, for a total of six.

    Try:

    if (@match = $sequ{$k} =~ m/([A-Z])\1{4,}/g)