weird subroutine behavior

flaviusm has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I have a problem with one subroutine and I would appreciate your help finding the problem and the solution for it.

the program is below:


use strict;
use warnings;

# set 1
my $st1 = 'SALMWN DE EGENNHSEN TON BOOZ EK THS RAXAB BOOZ DE EGENNHSEN
+ TON WBHD EK THS ROUQ WBHD DE EGENNHSEN TON IESSAI';
my $st2 = 'SALMWN DE EGENNHSEN TON BOES EK THS RAXAB BOES DE EGENNHSEN
+ TON IWBHD EK THS ROUQ IWBHD DE EGENNHSEN TON IESSAI';

# set 2
my $st3 = 'IOUDAS DE EGENNHSEN TON FARES KAI TON ZARA EK THS QAMAR FAR
+ES DE EGENNHSEN TON ESRWM ESRWM DE EGENNHSEN TON ARAM';
my $st4 = 'IOUDAS DE EGENNHSEN TON FARES KAI TON ZARA EK THS QAMAR FAR
+ES DE EGENNHSEN TON ESRWM ESRWM DE EGENNHSEN TON ARAM';

# section 1
my @test1 = all($st3, $st4);
foreach my $t1 (@test1){
    print "$t1\n";
}

# section 2
print "\n\n\n";
my @test2 = all($st1, $st2);
foreach my $t2 (@test2){
    print "$t2\n";
}

sub all {

    my ($str1, $str2) = @_;
    my @s1 = split(/\s+/, $str1);
    my @s2 = split(/\s+/, $str2);    
    my @matrix = ();
    my %substrings = ();
    my $id = 0;
    
    for(my $i=0; $i <= $#s2; $i++){
            for(my $j=0; $j <= $#s1; $j++){
                    if( "$s1[$j]" eq "$s2[$i]"){
                            if( $i == 0 || $j == 0){
                                    $matrix[$i][$j] = 1;
                            } else {
                                    $matrix[$i][$j] = $matrix[$i-1][$j
+-1] + 1;
                                    if( $i == $#s2 || $j == $#s1 ){
                                        $substrings{$id}[0]    = $j-$m
+atrix[$i][$j]+1;
                                        $substrings{$id}[1]    = $j;
                                        $substrings{$id}[2]    = $i-$m
+atrix[$i][$j]+1;
                                        $substrings{$id}[3]    = $i;
                                        $id++;
                                    }
                            }
                    } else {
                            $matrix[$i][$j] = 0;
                             if( $i != 0 && $j != 0 && $matrix[$i-1][$
+j-1] != 0){
                                       $substrings{$id}[0] = $j-$matri
+x[$i-1][$j-1];
                                       $substrings{$id}[1] = $j-1;
                                       $substrings{$id}[2] = $i-$matri
+x[$i-1][$j-1];
                                       $substrings{$id}[3] = $i-1;
                                       $id++;
                             }
                    }
            }
    }    
    my @substr_mat = ();
    my $substr_tmp1 = ();
    my $substr_tmp2 = ();
    my $start = ();
    my $end = ();    
    my %map1 = ();
    my %map2 = ();
    my $m = 0;
    
    foreach my $str (sort {($substrings{$b}[1]-$substrings{$b}[0]) <=>
+ ($substrings{$a}[1]-$substrings{$a}[0]) || $substrings{$a}[0] <=> $s
+ubstrings{$b}[0]} keys %substrings){            
        $substr_tmp1 = '';
        $substr_tmp2 = '';
        
        $start = -1; $end = -1;
        for( my $i = $substrings{$str}[0]; $i <= $substrings{$str}[1];
+ $i++){
            if( ! exists $map1{$i}){
                   $map1{$i} = 1;
                $substr_tmp1 .= "$s1[$i] ";
                if($start == -1){
                    $start = $end = $i;
                } else {
                    $end = $i;
                }
            }
           }
        next if $start == -1;
        
        $start = -1; $end = -1;
        for( my $i = $substrings{$str}[2]; $i <= $substrings{$str}[3];
+ $i++){
            if( ! exists $map2{$i}){
                   $map2{$i} = 1;
                $substr_tmp2 .= "$s2[$i] ";
                if($start == -1){
                    $start = $end = $i;
                } else {
                    $end = $i;
                }
            }
           }
        next if $start == -1;
        
        if( length($substr_tmp1) <= length($substr_tmp2) ){
            $substr_mat[$m++] = $substr_tmp1;
        } else {
                $substr_mat[$m++] = $substr_tmp2;
        }        
    }
    return @substr_mat;    
}
[download]

Desired functionality:

- The subroutine "all" should work "properly" on any set of data input, independent of the previous data input

Problem description:

If I comment out "set 1" and "section 1", the output of the "section 2" will contain 5 strings (I consider this the proper behavior of the subroutine "all") , but if I remove the comments for "set 1" and "section 1", the output of "section 2" will be changed by removing one of the 5 strings.

So, it seems to me that there is an interference between running "set 1" and "section 1" and running "set 2" and "section 2". I would like to find and remove this interference.

NOTE: If I change "$st3" and "$st4" with some ordinary strings, this interference will disappear.

Please let me know if I can provide any other details

Thank you.

Comment on weird subroutine behavior Download Code

Replies are listed 'Best First'.
Re: weird subroutine behavior by kyle (Abbot) on Apr 02, 2008 at 14:47 UTC
I'm not sure this is a solution, but it's at least part of the problem. See here: `foreach my $str (sort {($substrings{$b}[1]-$substrings{$b}[0]) <=> + ($substrings{$a}[1]-$substrings{$a}[0]) \|\| $substrings{$a}[0] <=> $s +ubstrings{$b}[0]} keys %substrings){` [download] The conditions you give to sort do not differentiate every element. If I go to the end of that block and add "`\|\| warn 'identical item'`", this will tell me if there's some pair in the list that the block can't distinguish. Running that, I see that `sort` encounters such pairs several times. I don't fully understand everything you're doing here, but I would guess that the elements that the condition sees as identical aren't really identical. So what I think is happening is that on different runs, that `sort` is sorting differently. I'm not sure why running it once or twice makes a difference, but I suspect it's just `perl`'s internals. For example, maybe keys is giving you the same things in different order, and `sort` preserves that order. Anyway, if I add "`\|\| $b cmp $a`" in there, it seems to act the way you want. Since you're sorting `keys`, and they're unique, you can be pretty sure this will offer consistent behavior. The new full line is: `foreach my $str (sort {($substrings{$b}[1]-$substrings{$b}[0]) <=> + ($substrings{$a}[1]-$substrings{$a}[0]) \|\| $substrings{$a}[0] <=> $s +ubstrings{$b}[0] \|\| $b cmp $a } keys %substrings){` [download]	[reply] [d/l] [select]
Re^2: weird subroutine behavior by flaviusm (Acolyte) on Apr 02, 2008 at 16:21 UTC
Thanks a lot. If I change the "sort" to the proposed version, the subroutine works as expected Since, all of you were thinking about what I was trying to accomplish with this "sort", this is what I tried to do: having: $substrings[0] $substrings[1] $substrings[2] $substrings[3] I wanted to sort by two keys: 1. Descending - by the difference of $substrings[1] and $substrings[0] 2. Ascending - by $substrings[0] Please let me know if you can lead me to a better and safer version of "sort".	[reply]
Re: weird subroutine behavior by BrowserUk (Patriarch) on Apr 02, 2008 at 14:48 UTC
I get identical results from your code regardless of whether I comment out one or the other sets/sections?: c:\test>junk9 ## Both sets uncommented SALMWN DE EGENNHSEN TON DE EGENNHSEN TON IESSAI EK THS RAXAB DE EGENNHSEN TON EK THS ROUQ --------- IOUDAS DE EGENNHSEN TON FARES KAI TON ZARA EK THS QAMAR FARES DE EGENN +HSEN TON ESRWM ESRWM DE EGENNHSEN TON ARAM c:\test>junk9 ## Set one/section one +commented --------- IOUDAS DE EGENNHSEN TON FARES KAI TON ZARA EK THS QAMAR FARES DE EGENN +HSEN TON ESRWM ESRWM DE EGENNHSEN TON ARAM c:\test>junk9 ## Set two/section two +commented SALMWN DE EGENNHSEN TON DE EGENNHSEN TON IESSAI EK THS RAXAB DE EGENNHSEN TON EK THS ROUQ --------- [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re: weird subroutine behavior by derby (Abbot) on Apr 02, 2008 at 15:24 UTC
I cannot even begin to fully fathom what your attempting but it appears your algorithm has some type of odd dependency on the order of the keys. If I change, ~~`foreach my $str (sort {($substrings{$b}[1]-$substrings{$b}[0]) <=> ($s +ubstrings{$a}[1]-$substrings{$a}[0]) \|\| $substrings{$a}[0] < => $substrings{$b}[0]} keys %substrings){`~~ ~~[download]~~ to `foreach my $str (sort {($substrings{$b}[1]-$substrings{$b}[0]) <=> ($s +ubstrings{$a}[1]-$substrings{$a}[0]) \|\| $substrings{$a}[0] < => $substrings{$b}[0]} sort keys %substrings){` [download] Then the runs are identical. -derby Update: Yikes! scratch all of that except, I cannot even begin to fully fathom. There is something wrong with the sort (as pointed out by kyle).	[reply] [d/l] [select]
Re^2: weird subroutine behavior by flaviusm (Acolyte) on Apr 02, 2008 at 16:34 UTC
I added the extra "sort" in front of "keys" but there was not change in behavior: the subroutine didn't work as expected. I don't know if there is something related with my perl version (v5.8.7 built for cygwin-thread-multi-64int) but I will try it on a different computer, too.	[reply]
Re: weird subroutine behavior by thundergnat (Deacon) on Apr 02, 2008 at 15:50 UTC
Your logic is hard to follow, and I'm not certain exactly what you are trying to do with this. It looks like you are trying to find all of the common sub sequences of words (sorted by length?). You may be better off using a module to help with that. Algorithm::Diff seems a likely candidate. Something like this perhaps. use strict; use warnings; use Algorithm::Diff qw/LCS LCSidx/; # set 1 my $st1 = 'SALMWN DE EGENNHSEN TON BOOZ EK THS RAXAB BOOZ DE EGENNHSEN + TON WBHD EK THS ROUQ WBHD DE EGENNHSEN TON IESSAI'; my $st2 = 'SALMWN DE EGENNHSEN TON BOES EK THS RAXAB BOES DE EGENNHSEN + TON IWBHD EK THS ROUQ IWBHD DE EGENNHSEN TON IESSAI'; # set 2 my $st3 = 'IOUDAS DE EGENNHSEN TON FARES KAI TON ZARA EK THS QAMAR FAR +ES DE EGENNHSEN TON ESRWM ESRWM DE EGENNHSEN TON ARAM'; my $st4 = 'IOUDAS DE EGENNHSEN TON FARES KAI TON ZARA EK THS QAMAR FAR +ES DE EGENNHSEN TON ESRWM ESRWM DE EGENNHSEN TON ARAM'; print join "\n", all ($st3, $st4); print "\n\n\n"; print join "\n", all ($st1, $st2); sub all{ my @seq1 = split / +/, $_[0]; my @seq2 = split / +/, $_[1]; my $idx = LCSidx( \@seq1, \@seq2 ); my $index = 0; my $result; for (@$idx){ if ($_ == $index){ $result .= $seq1[$_].' '; $index++; }else{ $result .= "\n"; $index = $_ + 1; } } chop $result; return split / ?\n+/, $result; #in order seen #or #return sort {length $b <=> length $a} split / ?\n+/, $result; #so +rted by length } [download] Or, more succinctly... `sub all{ my @seq1 = split / +/, $_[0]; my @seq2 = split / +/, $_[1]; my $idx = LCSidx( \@seq1, \@seq2 ); my @result = (' ') x @seq1; @result[@$idx] = @seq1[@$idx]; return split / {2,}/, join ' ', @result; #in order seen #or return sort {length $b <=> length $a} split / {2,}/, join ' ', @re +sult; #sorted by length }` [download]	[reply] [d/l] [select]
Re^2: weird subroutine behavior by flaviusm (Acolyte) on Apr 02, 2008 at 16:10 UTC
I have tried to use this module along with others (e.g. lcss_xs ...) but since my problem required multiple customizations, I ended up writing the code myself. Thank you for the idea.	[reply]
Re: weird subroutine behavior by jdporter (Paladin) on Apr 03, 2008 at 03:07 UTC
It is truly a weird and difficult problem. Lacking any deep expertise in this area, I'm inclined to suppose that you're hitting a strange perl bug related to pad handling. Anyway... ... output of the "section 2" will contain 5 strings (I consider this the proper behavior of the subroutine "all") Are you absolutely sure? Let me make a suggestion. I notice that you're not really using `%substrings` in a way suited to a hash. You add values sequentially, and then iterate over the (sorted) values. You never use the keys at all. Try replacing that hash with an array, pushing the values (those 4-tuples) onto it, and sorting that array of values. When I modified your program in this way, I got consistent behavior — but it was consistently the way other than what you said you consider the correct way. A word spoken in Mind will reach its own level, in the objective world, by its own weight	[reply] [d/l]
Re^2: weird subroutine behavior by flaviusm (Acolyte) on Apr 03, 2008 at 22:03 UTC
Good news. I replaced the hash with an array and the results were at least consistent. I analyzed the output and I found a flaw in my algorithm. I fixed the algorithm and the program works fine now. However, the weird behavior remains, ... maybe as a meditation. Thanks a lot.	[reply]
Re^3: weird subroutine behavior by jdporter (Paladin) on Apr 03, 2008 at 23:51 UTC
Excellent! Would you mind posting your updated code? However, the weird behavior remains, I assume you mean it remains in the original code, but not in your new version, right?	[reply]
Re^4: weird subroutine behavior by flaviusm (Acolyte) on Apr 04, 2008 at 14:54 UTC
Re^2: weird subroutine behavior by flaviusm (Acolyte) on Apr 03, 2008 at 16:16 UTC
I will do hash->array change tonight and I will let you know how it goes. Thanks.	[reply]
reopenedRe: weird subroutine behavior by flaviusm (Acolyte) on Apr 02, 2008 at 23:54 UTC
I ran the solution proposed by kyle against my corpus of documents and I found another exception. If you run only "section 1" with the below data input, the solution proposed by kyle will introduce the improper behavior of the subroutine, while the original version will work ok. `my $st5 = 'PASAI OUN AI GENEAI APO ABRAAM EWS DABID GENEAI DEKATESSARE +S KAI APO DABID EWS THS METOIKESIAS BABULWNOS GENEAI DEKATESSARES KAI + APO THS METOIKESIAS BABULWNOS EWS TOU XRISTOU GENEAI DEKATESSARES'; my $st6 = 'PASAI OUN AI GENEAI APO ABRAAM EWS DAUID GENEAI DEKATESSARE +S KAI APO DAUID EWS THS METOIKESIAS BABULWNOS GENEAI DEKATESSARES KAI + APO THS METOIKESIAS BABULWNOS EWS TOU XRISTOU GENEAI DEKATESSARES';` [download] Note: add and remove the "\|\| $b cmp $a" from the end of "sort" to see the different behavior. I really appreciate your time and efforts to help me find a solution for this weird and difficult problem.	[reply] [d/l]
Re: reopenedRe: weird subroutine behavior by kyle (Abbot) on Apr 03, 2008 at 15:37 UTC
I looked at this some more last night, but I never came up with a solution. I did come up with a test framework to make it easier to try things, so I'll post that. Read more... (3 kB) I also refactored a little. Read more... (4 kB) At this point, I suspect that whatever algorithm you're using just isn't doing what you expect it to do. Since I still can't tell what it's supposed to be doing, I can't be sure. I very much suggest practicing naming your variables. `%substrings` never contains any strings—keys or values. `@matrix` is not very descriptive. I never did figure out what `%map1` and `%map2` were really supposed to do. I wish I could offer more help here. Good luck with your problem.	[reply] [d/l] [select]
Re^2: reopenedRe: weird subroutine behavior by flaviusm (Acolyte) on Apr 03, 2008 at 16:07 UTC
Thanks a lot for your help. I will try to explain below what I am trying to do in the subroutine "all". Purpose: - to find ALL common word groups and return them in the reverse order of the word count, if two groups have the same word count, the group showing up earlier in the string it is preferred. Algorithm: - I am using the dynamic algorithm for LCSS modified to return all substrings but remove the strings that overlap (the returned strings (common) + the new strings (not selected) = original sentence) - the matrix keeps track of occurrences - you can find here http://en.wikipedia.org/wiki/Longest_common_substring_problem an explanation of the dynamic algorithm. - while I am completing the matrix of occurrences I record in the hash %substrings the start and end index of the detected common substrings. This will allow me later on to eliminate the overlapping substrings. - so, the first two values of substrings are for occurrences in the first string, the last two for occurrences in the second string. - now, it comes the foreach statement: * map1 and map2 are mask arrays for the substrings * if I got one substring from indexes 4-10 and then I get another substring between 2-5 (overlapping on the longer one), then I have to adjust the last one to 2-3. If it is going to become blank, I simply reject it. * reason for "sort": I have to start from substrings with the biggest word count in order to make sure I do not consider a substring which is contained in a bigger string. I am not sure I was very clear, but please let me know what I can detail more and explain better.	[reply]
Re^3: reopenedRe: weird subroutine behavior by kyle (Abbot) on Apr 03, 2008 at 19:24 UTC