Help!!! How to find duplicates?

stylerr has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

Consider this text:

my $text = " 
$sub24835->($sub24839->($sub24828->($sub24840->("( a1"),$sub24841->(" 
+a1 ) ")),$sub24830->($sub24853->("( a2 "),$sub24854->(" a2 )")),$sub2
+4828->($sub24840->("( a1"),$sub24841->(" a1 ) ")),$sub24842->($sub248
+43->("0"),$sub24830->($sub24853->("( a2 "),$sub24854->(" a2 )")),$sub
+24828->($sub24840->("( a1"),$sub24841->(" a1 ) "))),$sub24830->($sub2
+4853->("( a2 "),$sub24854->(" a2 )")),$sub24828->($sub24840->("( a1")
+,$sub24841->(" a1 ) ")),$sub24832->($sub24855->("1"),$sub24856->($sub
+24857->("| a3"),$sub24859->("a3 |")),$sub24858->("a3")),$sub24849->($
+sub24850->("1"),$sub24830->($sub24853->("( a2 "),$sub24854->(" a2 )")
+),$sub24832->($sub24855->("1"),$sub24856->($sub24857->("| a3"),$sub24
+859->("a3 |")),$sub24858->("a3"))))) 
";
[download]

I need to find all function call duplicates: something like this:

1. $sub24828->($sub24840->("( a1"),$sub24841->(" a1 ) "))

2. $sub24830->($sub24853->("( a2 "),$sub24854->(" a2 )"))

Both 1. and 2. are occurred more than one time in the text.

I tried to use this regex:

my @ttt = $ttt =~ /(\$sub\d+->\(.*\))?.*?\1/gs;
[download]

But it is not correct.

The main problem is that I want to extract THE WHOLE function call expression:

$sub111->(...) (it should contain BOTH opening AND closing parentheses).

See extracted examples above.

Thanks in advance.

Comment on Help!!! How to find duplicates? Select or Download Code

Replies are listed 'Best First'.
Re: Help!!! How to find duplicates? by jettero (Monsignor) on Dec 04, 2009 at 20:30 UTC
Traditionally, RE can't do this kind of balanced matching. Modern perl 5.10s have special support for it and even older perls have special (?{ code }) matchers than can do the counting for you. What you really want is a parser. Your best bet is probably: Text::Balanced rather than RE. Although, to be honest, I've done it both ways and prefer using (?{ code }) to T::B as I find it difficult to operate. -Paul	[reply]
Re^2: Help!!! How to find duplicates? by stylerr (Initiate) on Dec 07, 2009 at 16:27 UTC
Thanks for your comments. Here is solution. use Regexp::Common qw /balanced/; my $ttt = '$sub24835->($sub24839->( $sub24828->($sub24840->("( a1"),$sub24841->(" a1 ) ")), $sub24830->($sub24853->("( a2 "),$sub24854->(" a2 )")),$sub24828->($su +b24840->("( a1"),$sub24841->(" a1 ) ")),$sub24842->($sub24843->("0"), +$sub24830->($sub24853->("( a2 "),$sub24854->(" a2 )")),$sub24828->($s +ub24840->("( a1"),$sub24841->(" a1 ) "))),$sub24830->($sub24853->("( +a2 "),$sub24854->(" a2 )")),$sub24828->($sub24840->("( a1"),$sub24841 +->(" a1 ) ")),$sub24832->($sub24855->("1"),$sub24856->($sub24857->("\| + a3"),$sub24859->("a3 \|")),$sub24858->("a3")),$sub24849->($sub24850-> +("1"),$sub24830->($sub24853->("( a2 "),$sub24854->(" a2 )")),$sub2483 +2->($sub24855->("1"),$sub24856->($sub24857->("\| a3"),$sub24859->("a3 +\|")),$sub24858->("a3")))))'; my @ttt = $ttt =~ /(\$sub\d+->$RE{balanced}{-parens=>'()'}).*?\1/sg; print scalar @ttt, "\n"; print join("\n", @ttt); [download]	[reply] [d/l]
Re: Help!!! How to find duplicates? by jacques (Priest) on Dec 05, 2009 at 00:52 UTC
The problem here is in separating your function calls. I came up with a solution that split your text on commas but then I saw that certain function calls contain commas and that wouldn't work. Try separating each function call with something other than a comma, like a special character. Then you can split that string on that character and grep out the duplicates. Better yet, forgo putting the calls in a string and put them in an array and then grep that.	[reply]
Re: Help!!! How to find duplicates? by djp (Hermit) on Dec 05, 2009 at 10:20 UTC
Regexp::Common will match up the parentheses for you via Regexp::Common::balanced.	[reply]