stylerr has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

Consider this text:

my $text = " $sub24835->($sub24839->($sub24828->($sub24840->("( a1"),$sub24841->(" +a1 ) ")),$sub24830->($sub24853->("( a2 "),$sub24854->(" a2 )")),$sub2 +4828->($sub24840->("( a1"),$sub24841->(" a1 ) ")),$sub24842->($sub248 +43->("0"),$sub24830->($sub24853->("( a2 "),$sub24854->(" a2 )")),$sub +24828->($sub24840->("( a1"),$sub24841->(" a1 ) "))),$sub24830->($sub2 +4853->("( a2 "),$sub24854->(" a2 )")),$sub24828->($sub24840->("( a1") +,$sub24841->(" a1 ) ")),$sub24832->($sub24855->("1"),$sub24856->($sub +24857->("| a3"),$sub24859->("a3 |")),$sub24858->("a3")),$sub24849->($ +sub24850->("1"),$sub24830->($sub24853->("( a2 "),$sub24854->(" a2 )") +),$sub24832->($sub24855->("1"),$sub24856->($sub24857->("| a3"),$sub24 +859->("a3 |")),$sub24858->("a3"))))) ";

I need to find all function call duplicates: something like this:

1. $sub24828->($sub24840->("( a1"),$sub24841->(" a1 ) "))

2. $sub24830->($sub24853->("( a2 "),$sub24854->(" a2 )"))

Both 1. and 2. are occurred more than one time in the text.

I tried to use this regex:

my @ttt = $ttt =~ /(\$sub\d+->\(.*\))?.*?\1/gs;

But it is not correct.

The main problem is that I want to extract THE WHOLE function call expression:

$sub111->(...) (it should contain BOTH opening AND closing parentheses).

See extracted examples above.

Thanks in advance.

Replies are listed 'Best First'.
Re: Help!!! How to find duplicates?
by jettero (Monsignor) on Dec 04, 2009 at 20:30 UTC
    Traditionally, RE can't do this kind of balanced matching. Modern perl 5.10s have special support for it and even older perls have special (?{ code }) matchers than can do the counting for you. What you really want is a parser.

    Your best bet is probably: Text::Balanced rather than RE. Although, to be honest, I've done it both ways and prefer using (?{ code }) to T::B as I find it difficult to operate.

    -Paul

      Thanks for your comments.

      Here is solution.

      use Regexp::Common qw /balanced/; my $ttt = '$sub24835->($sub24839->( $sub24828->($sub24840->("( a1"),$sub24841->(" a1 ) ")), $sub24830->($sub24853->("( a2 "),$sub24854->(" a2 )")),$sub24828->($su +b24840->("( a1"),$sub24841->(" a1 ) ")),$sub24842->($sub24843->("0"), +$sub24830->($sub24853->("( a2 "),$sub24854->(" a2 )")),$sub24828->($s +ub24840->("( a1"),$sub24841->(" a1 ) "))),$sub24830->($sub24853->("( +a2 "),$sub24854->(" a2 )")),$sub24828->($sub24840->("( a1"),$sub24841 +->(" a1 ) ")),$sub24832->($sub24855->("1"),$sub24856->($sub24857->("| + a3"),$sub24859->("a3 |")),$sub24858->("a3")),$sub24849->($sub24850-> +("1"),$sub24830->($sub24853->("( a2 "),$sub24854->(" a2 )")),$sub2483 +2->($sub24855->("1"),$sub24856->($sub24857->("| a3"),$sub24859->("a3 +|")),$sub24858->("a3")))))'; my @ttt = $ttt =~ /(\$sub\d+->$RE{balanced}{-parens=>'()'}).*?\1/sg; print scalar @ttt, "\n"; print join("\n", @ttt);
Re: Help!!! How to find duplicates?
by jacques (Priest) on Dec 05, 2009 at 00:52 UTC
    The problem here is in separating your function calls. I came up with a solution that split your text on commas but then I saw that certain function calls contain commas and that wouldn't work.

    Try separating each function call with something other than a comma, like a special character. Then you can split that string on that character and grep out the duplicates.

    Better yet, forgo putting the calls in a string and put them in an array and then grep that.

Re: Help!!! How to find duplicates?
by djp (Hermit) on Dec 05, 2009 at 10:20 UTC
    Regexp::Common will match up the parentheses for you via Regexp::Common::balanced.