in reply to Re^2: **reopened**Re: weird subroutine behavior
in thread weird subroutine behavior

Thanks for your reply. That was illuminating!

This is what I wound up with (passes all tests):

use Test::More; my @test_data = ( [ 'set 1', 'SALMWN DE EGENNHSEN TON BOOZ EK THS RAXAB BOOZ DE EGENNHSEN TON WBHD +EK THS ROUQ WBHD DE EGENNHSEN TON IESSAI', 'SALMWN DE EGENNHSEN TON BOES EK THS RAXAB BOES DE EGENNHSEN TON IWBHD + EK THS ROUQ IWBHD DE EGENNHSEN TON IESSAI', [ 'SALMWN DE EGENNHSEN TON ', 'DE EGENNHSEN TON IESSAI ', 'EK THS RAXAB ', 'DE EGENNHSEN TON ', 'EK THS ROUQ ' ] ], [ 'set 2', 'IOUDAS DE EGENNHSEN TON FARES KAI TON ZARA EK THS QAMAR FARES DE EGEN +NHSEN TON ESRWM ESRWM DE EGENNHSEN TON ARAM', 'IOUDAS DE EGENNHSEN TON FARES KAI TON ZARA EK THS QAMAR FARES DE EGEN +NHSEN TON ESRWM ESRWM DE EGENNHSEN TON ARAM', [ 'IOUDAS DE EGENNHSEN TON FARES KAI TON ZARA EK THS QAMAR FARES DE EGEN +NHSEN TON ESRWM ESRWM DE EGENNHSEN TON ARAM ' ] ], [ 'set 3', 'PASAI OUN AI GENEAI APO ABRAAM EWS DABID GENEAI DEKATESSARES KAI APO +DABID EWS THS METOIKESIAS BABULWNOS GENEAI DEKATESSARES KAI APO THS M +ETOIKESIAS BABULWNOS EWS TOU XRISTOU GENEAI DEKATESSARES', 'PASAI OUN AI GENEAI APO ABRAAM EWS DAUID GENEAI DEKATESSARES KAI APO +DAUID EWS THS METOIKESIAS BABULWNOS GENEAI DEKATESSARES KAI APO THS M +ETOIKESIAS BABULWNOS EWS TOU XRISTOU GENEAI DEKATESSARES', [ 'EWS THS METOIKESIAS BABULWNOS GENEAI DEKATESSARES KAI APO THS METOIKE +SIAS BABULWNOS EWS TOU XRISTOU GENEAI DEKATESSARES ', 'PASAI OUN AI GENEAI APO ABRAAM EWS ', 'GENEAI DEKATESSARES KAI APO ' ] ], ); plan 'tests' => scalar @test_data; foreach my $test (@test_data) { my $name = $test->[0]; my @input = @{$test}[ 1, 2 ]; my $wanted = $test->[3]; my @result = all_new(@input); is_deeply( \@result, $wanted, $name ); } sub all_new { my ( $str1, $str2 ) = @_; my @s1 = split( /\s+/, $str1 ); my @s2 = split( /\s+/, $str2 ); my @matrix = (); my %substrings = (); my $id = 0; for ( my $i = 0 ; $i <= $#s2 ; $i++ ) { for ( my $j = 0 ; $j <= $#s1 ; $j++ ) { if ( "$s1[$j]" eq "$s2[$i]" ) { if ( $i == 0 || $j == 0 ) { $matrix[$i][$j] = 1; } else { $matrix[$i][$j] = $matrix[ $i - 1 ][ $j - 1 ] + 1; if ( $i == $#s2 || $j == $#s1 ) { $substrings{$id}[0] = $j - $matrix[$i][$j] + 1 +; $substrings{$id}[1] = $j; $substrings{$id}[2] = $i - $matrix[$i][$j] + 1 +; $substrings{$id}[3] = $i; $id++; } } } else { $matrix[$i][$j] = 0; if ( $i != 0 && $j != 0 && $matrix[ $i - 1 ][ $j - 1 ] + != 0 ) { $substrings{$id}[0] = $j - $matrix[ $i - 1 ][ $j - + 1 ]; $substrings{$id}[1] = $j - 1; $substrings{$id}[2] = $i - $matrix[ $i - 1 ][ $j - + 1 ]; $substrings{$id}[3] = $i - 1; $id++; } } } } my @substr_mat = (); my %map1 = (); my %map2 = (); foreach my $str ( sort { ( $substrings{$b}[1] - $substrings{$b}[0] ) <=> ( $substrings{$a}[1] - $substrings{$a}[0] ) || $substrings{$a}[0] <=> $substrings{$b}[0] } keys %substrings ) { my $substr_tmp1 = ''; my $substr_tmp2 = ''; foreach my $i ( $substrings{$str}[0] .. $substrings{$str}[1] ) + { if ( !$map1{$i}++ ) { $substr_tmp1 .= "$s1[$i] "; } } next if !$substr_tmp1; foreach my $i ( $substrings{$str}[2] .. $substrings{$str}[3] ) + { if ( !$map2{$i}++ ) { $substr_tmp2 .= "$s2[$i] "; } } next if !$substr_tmp2; push @substr_mat, ( length $substr_tmp1 <= length $substr_tmp2 + ) ? { str => $substr_tmp1, wc => ( $substrings{$str}[1] - $substrings{$str}[0] ), site => $substrings{$str}[0] } : { str => $substr_tmp2, wc => ( $substrings{$str}[3] - $substrings{$str}[2] ), site => $substrings{$str}[0] }; } return map { $_->{str} } sort { $b->{wc} <=> $a->{wc} || $a->{site} <=> $b->{site} } @sub +str_mat; }

This hasn't changed very much. In @substr_mat, instead of strings, I put hash refs. Each hash ref has in it the string, the word count, and the site where the string was found. The site is measured in words, so if you have "BLAHBLAH FOO" and "BLAH BAR", "FOO" and "BAR" are considered to be at the same "site".

That data structure looks like this:

$VAR1 = [ { 'site' => 0, 'str' => 'SALMWN DE EGENNHSEN TON ', 'wc' => 3 }, { 'site' => 17, 'str' => 'DE EGENNHSEN TON IESSAI ', 'wc' => 3 }, { 'site' => 5, 'str' => 'EK THS RAXAB ', 'wc' => 2 }, { 'site' => 9, 'str' => 'DE EGENNHSEN TON ', 'wc' => 2 }, { 'site' => 13, 'str' => 'EK THS ROUQ ', 'wc' => 2 } ];

(I just now noticed my word count is off by one. This isn't a problem for us because it will still sort correctly.)

So, before returning, I sort by the word count, then the site, and finally pass it through a map to turn it into simple strings.

I get the impression that the sort at the top of the foreach is supposed to do this work, but I think it's getting confused by the stuff going on in the body of the loop.