Re: weird subroutine behavior
by kyle (Abbot) on Apr 02, 2008 at 14:47 UTC
|
I'm not sure this is a solution, but it's at least part of the problem. See here:
foreach my $str (sort {($substrings{$b}[1]-$substrings{$b}[0]) <=>
+ ($substrings{$a}[1]-$substrings{$a}[0]) || $substrings{$a}[0] <=> $s
+ubstrings{$b}[0]} keys %substrings){
The conditions you give to sort do not differentiate every element. If I go to the end of that block and add "|| warn 'identical item'", this will tell me if there's some pair in the list that the block can't distinguish. Running that, I see that sort encounters such pairs several times. I don't fully understand everything you're doing here, but I would guess that the elements that the condition sees as identical aren't really identical.
So what I think is happening is that on different runs, that sort is sorting differently. I'm not sure why running it once or twice makes a difference, but I suspect it's just perl's internals. For example, maybe keys is giving you the same things in different order, and sort preserves that order.
Anyway, if I add "|| $b cmp $a" in there, it seems to act the way you want. Since you're sorting keys, and they're unique, you can be pretty sure this will offer consistent behavior. The new full line is:
foreach my $str (sort {($substrings{$b}[1]-$substrings{$b}[0]) <=>
+ ($substrings{$a}[1]-$substrings{$a}[0]) || $substrings{$a}[0] <=> $s
+ubstrings{$b}[0] || $b cmp $a } keys %substrings){
| [reply] [d/l] [select] |
|
|
Thanks a lot. If I change the "sort" to the proposed version, the subroutine works as expected
Since, all of you were thinking about what I was trying to accomplish with this "sort", this is what I tried to do:
having:
$substrings[0]
$substrings[1]
$substrings[2]
$substrings[3]
I wanted to sort by two keys:
1. Descending - by the difference of $substrings[1] and $substrings[0]
2. Ascending - by $substrings[0]
Please let me know if you can lead me to a better and safer version of "sort".
| [reply] |
Re: weird subroutine behavior
by BrowserUk (Patriarch) on Apr 02, 2008 at 14:48 UTC
|
I get identical results from your code regardless of whether I comment out one or the other sets/sections?:
c:\test>junk9 ## Both sets uncommented
SALMWN DE EGENNHSEN TON
DE EGENNHSEN TON IESSAI
EK THS RAXAB
DE EGENNHSEN TON
EK THS ROUQ
---------
IOUDAS DE EGENNHSEN TON FARES KAI TON ZARA EK THS QAMAR FARES DE EGENN
+HSEN TON ESRWM ESRWM DE EGENNHSEN TON ARAM
c:\test>junk9 ## Set one/section one
+commented
---------
IOUDAS DE EGENNHSEN TON FARES KAI TON ZARA EK THS QAMAR FARES DE EGENN
+HSEN TON ESRWM ESRWM DE EGENNHSEN TON ARAM
c:\test>junk9 ## Set two/section two
+commented
SALMWN DE EGENNHSEN TON
DE EGENNHSEN TON IESSAI
EK THS RAXAB
DE EGENNHSEN TON
EK THS ROUQ
---------
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
Re: weird subroutine behavior
by derby (Abbot) on Apr 02, 2008 at 15:24 UTC
|
I cannot even begin to fully fathom what your attempting but it appears your algorithm has some type of odd dependency on the order of the keys. If I change,
foreach my $str (sort {($substrings{$b}[1]-$substrings{$b}[0]) <=> ($s
+ubstrings{$a}[1]-$substrings{$a}[0]) || $substrings{$a}[0] <
=> $substrings{$b}[0]} keys %substrings){
to
foreach my $str (sort {($substrings{$b}[1]-$substrings{$b}[0]) <=> ($s
+ubstrings{$a}[1]-$substrings{$a}[0]) || $substrings{$a}[0] <
=> $substrings{$b}[0]} sort keys %substrings){
Then the runs are identical.
Update: Yikes! scratch all of that except, I cannot even begin to fully fathom. There is something wrong with the sort (as pointed out by kyle).
| [reply] [d/l] [select] |
|
|
I added the extra "sort" in front of "keys" but there was not change in behavior: the subroutine didn't work as expected.
I don't know if there is something related with my perl version (v5.8.7 built for cygwin-thread-multi-64int) but I will try it on a different computer, too.
| [reply] |
Re: weird subroutine behavior
by thundergnat (Deacon) on Apr 02, 2008 at 15:50 UTC
|
Your logic is hard to follow, and I'm not certain exactly what you are trying to do with this. It looks like you are trying to find all of the common sub sequences of words (sorted by length?). You may be better off using a module to help with that. Algorithm::Diff seems a likely candidate.
Something like this perhaps.
use strict;
use warnings;
use Algorithm::Diff qw/LCS LCSidx/;
# set 1
my $st1 = 'SALMWN DE EGENNHSEN TON BOOZ EK THS RAXAB BOOZ DE EGENNHSEN
+ TON WBHD EK THS ROUQ WBHD DE EGENNHSEN TON IESSAI';
my $st2 = 'SALMWN DE EGENNHSEN TON BOES EK THS RAXAB BOES DE EGENNHSEN
+ TON IWBHD EK THS ROUQ IWBHD DE EGENNHSEN TON IESSAI';
# set 2
my $st3 = 'IOUDAS DE EGENNHSEN TON FARES KAI TON ZARA EK THS QAMAR FAR
+ES DE EGENNHSEN TON ESRWM ESRWM DE EGENNHSEN TON ARAM';
my $st4 = 'IOUDAS DE EGENNHSEN TON FARES KAI TON ZARA EK THS QAMAR FAR
+ES DE EGENNHSEN TON ESRWM ESRWM DE EGENNHSEN TON ARAM';
print join "\n", all ($st3, $st4);
print "\n\n\n";
print join "\n", all ($st1, $st2);
sub all{
my @seq1 = split / +/, $_[0];
my @seq2 = split / +/, $_[1];
my $idx = LCSidx( \@seq1, \@seq2 );
my $index = 0;
my $result;
for (@$idx){
if ($_ == $index){
$result .= $seq1[$_].' ';
$index++;
}else{
$result .= "\n";
$index = $_ + 1;
}
}
chop $result;
return split / ?\n+/, $result; #in order seen
#or
#return sort {length $b <=> length $a} split / ?\n+/, $result; #so
+rted by length
}
Or, more succinctly...
sub all{
my @seq1 = split / +/, $_[0];
my @seq2 = split / +/, $_[1];
my $idx = LCSidx( \@seq1, \@seq2 );
my @result = (' ') x @seq1;
@result[@$idx] = @seq1[@$idx];
return split / {2,}/, join ' ', @result; #in order seen
#or
return sort {length $b <=> length $a} split / {2,}/, join ' ', @re
+sult; #sorted by length
}
| [reply] [d/l] [select] |
|
|
| [reply] |
Re: weird subroutine behavior
by jdporter (Paladin) on Apr 03, 2008 at 03:07 UTC
|
It is truly a weird and difficult problem. Lacking any deep expertise in this area,
I'm inclined to suppose that you're hitting a strange perl bug related to pad handling.
Anyway...
... output of the "section 2" will contain
5 strings (I consider this the proper behavior of the subroutine "all")
Are you absolutely sure?
Let me make a suggestion. I notice that you're not really using %substrings in
a way suited to a hash. You add values sequentially, and then iterate over the (sorted) values.
You never use the keys at all.
Try replacing that hash with an array, pushing the values (those 4-tuples) onto it,
and sorting that array of values.
When I modified your program in this way, I got consistent behavior — but it
was consistently the way other than what you said you consider the correct way.
A word spoken in Mind will reach its own level, in the objective world, by its own wei ght
| [reply] [d/l] |
|
|
Good news. I replaced the hash with an array and the results were at least consistent. I analyzed the output and I found a flaw in my algorithm. I fixed the algorithm and the program works fine now.
However, the weird behavior remains, ... maybe as a meditation.
Thanks a lot.
| [reply] |
|
|
Excellent! Would you mind posting your updated code?
However, the weird behavior remains,
I assume you mean it remains in the original code, but not in your new version, right?
| [reply] |
|
|
|
|
I will do hash->array change tonight and I will let you know how it goes. Thanks.
| [reply] |
**reopened**Re: weird subroutine behavior
by flaviusm (Acolyte) on Apr 02, 2008 at 23:54 UTC
|
I ran the solution proposed by kyle against my corpus of documents and I found another exception.
If you run only "section 1" with the below data input, the solution proposed by kyle will introduce the improper behavior of the subroutine, while the original version will work ok.
my $st5 = 'PASAI OUN AI GENEAI APO ABRAAM EWS DABID GENEAI DEKATESSARE
+S KAI APO DABID EWS THS METOIKESIAS BABULWNOS GENEAI DEKATESSARES KAI
+ APO THS METOIKESIAS BABULWNOS EWS TOU XRISTOU GENEAI DEKATESSARES';
my $st6 = 'PASAI OUN AI GENEAI APO ABRAAM EWS DAUID GENEAI DEKATESSARE
+S KAI APO DAUID EWS THS METOIKESIAS BABULWNOS GENEAI DEKATESSARES KAI
+ APO THS METOIKESIAS BABULWNOS EWS TOU XRISTOU GENEAI DEKATESSARES';
Note: add and remove the "|| $b cmp $a" from the end of "sort" to see the different behavior.
I really appreciate your time and efforts to help me find a solution for this weird and difficult problem.
| [reply] [d/l] |
|
|
I looked at this some more last night, but I never came up with a solution. I did come up with a test framework to make it easier to try things, so I'll post that.
I also refactored a little.
At this point, I suspect that whatever algorithm you're using just isn't doing what you expect it to do. Since I still can't tell what it's supposed to be doing, I can't be sure.
I very much suggest practicing naming your variables. %substrings never contains any strings—keys or values. @matrix is not very descriptive. I never did figure out what %map1 and %map2 were really supposed to do.
I wish I could offer more help here. Good luck with your problem.
| [reply] [d/l] [select] |
|
|
Thanks a lot for your help. I will try to explain below what I am trying to do in the subroutine "all".
Purpose:
- to find ALL common word groups and return them in the reverse order of the word count, if two groups have the same word count, the group showing up earlier in the string it is preferred.
Algorithm:
- I am using the dynamic algorithm for LCSS modified to return all substrings but remove the strings that overlap (the returned strings (common) + the new strings (not selected) = original sentence)
- the matrix keeps track of occurrences - you can find here http://en.wikipedia.org/wiki/Longest_common_substring_problem an explanation of the dynamic algorithm.
- while I am completing the matrix of occurrences I record in the hash %substrings the start and end index of the detected common substrings. This will allow me later on to eliminate the overlapping substrings.
- so, the first two values of substrings are for occurrences in the first string, the last two for occurrences in the second string.
- now, it comes the foreach statement:
* map1 and map2 are mask arrays for the substrings
* if I got one substring from indexes 4-10 and then I get another substring between 2-5 (overlapping on the longer one), then I have to adjust the last one to 2-3. If it is going to become blank, I simply reject it.
* reason for "sort": I have to start from substrings with the biggest word count in order to make sure I do not consider a substring which is contained in a bigger string.
I am not sure I was very clear, but please let me know what I can detail more and explain better.
| [reply] |
|
|