**reopened**Re: weird subroutine behavior

I ran the solution proposed by kyle against my corpus of documents and I found another exception.

If you run only "section 1" with the below data input, the solution proposed by kyle will introduce the improper behavior of the subroutine, while the original version will work ok.

my $st5 = 'PASAI OUN AI GENEAI APO ABRAAM EWS DABID GENEAI DEKATESSARE
+S KAI APO DABID EWS THS METOIKESIAS BABULWNOS GENEAI DEKATESSARES KAI
+ APO THS METOIKESIAS BABULWNOS EWS TOU XRISTOU GENEAI DEKATESSARES';
my $st6 = 'PASAI OUN AI GENEAI APO ABRAAM EWS DAUID GENEAI DEKATESSARE
+S KAI APO DAUID EWS THS METOIKESIAS BABULWNOS GENEAI DEKATESSARES KAI
+ APO THS METOIKESIAS BABULWNOS EWS TOU XRISTOU GENEAI DEKATESSARES';
[download]

Note: add and remove the "|| $b cmp $a" from the end of "sort" to see the different behavior.

I really appreciate your time and efforts to help me find a solution for this weird and difficult problem.

Comment on reopenedRe: weird subroutine behavior Download Code

Replies are listed 'Best First'.
Re: reopenedRe: weird subroutine behavior by kyle (Abbot) on Apr 03, 2008 at 15:37 UTC
I looked at this some more last night, but I never came up with a solution. I did come up with a test framework to make it easier to try things, so I'll post that. Read more... (3 kB) I also refactored a little. Read more... (4 kB) At this point, I suspect that whatever algorithm you're using just isn't doing what you expect it to do. Since I still can't tell what it's supposed to be doing, I can't be sure. I very much suggest practicing naming your variables. `%substrings` never contains any strings—keys or values. `@matrix` is not very descriptive. I never did figure out what `%map1` and `%map2` were really supposed to do. I wish I could offer more help here. Good luck with your problem.	[reply] [d/l] [select]
Re^2: reopenedRe: weird subroutine behavior by flaviusm (Acolyte) on Apr 03, 2008 at 16:07 UTC
Thanks a lot for your help. I will try to explain below what I am trying to do in the subroutine "all". Purpose: - to find ALL common word groups and return them in the reverse order of the word count, if two groups have the same word count, the group showing up earlier in the string it is preferred. Algorithm: - I am using the dynamic algorithm for LCSS modified to return all substrings but remove the strings that overlap (the returned strings (common) + the new strings (not selected) = original sentence) - the matrix keeps track of occurrences - you can find here http://en.wikipedia.org/wiki/Longest_common_substring_problem an explanation of the dynamic algorithm. - while I am completing the matrix of occurrences I record in the hash %substrings the start and end index of the detected common substrings. This will allow me later on to eliminate the overlapping substrings. - so, the first two values of substrings are for occurrences in the first string, the last two for occurrences in the second string. - now, it comes the foreach statement: * map1 and map2 are mask arrays for the substrings * if I got one substring from indexes 4-10 and then I get another substring between 2-5 (overlapping on the longer one), then I have to adjust the last one to 2-3. If it is going to become blank, I simply reject it. * reason for "sort": I have to start from substrings with the biggest word count in order to make sure I do not consider a substring which is contained in a bigger string. I am not sure I was very clear, but please let me know what I can detail more and explain better.	[reply]
Re^3: reopenedRe: weird subroutine behavior by kyle (Abbot) on Apr 03, 2008 at 19:24 UTC
Thanks for your reply. That was illuminating! This is what I wound up with (passes all tests): Read more... (6 kB) This hasn't changed very much. In `@substr_mat`, instead of strings, I put hash refs. Each hash ref has in it the string, the word count, and the site where the string was found. The site is measured in words, so if you have "BLAHBLAH FOO" and "BLAH BAR", "FOO" and "BAR" are considered to be at the same "site". That data structure looks like this: Read more... (1016 Bytes) (I just now noticed my word count is off by one. This isn't a problem for us because it will still sort correctly.) So, before returning, I sort by the word count, then the site, and finally pass it through a map to turn it into simple strings. I get the impression that the `sort` at the top of the `foreach` is supposed to do this work, but I think it's getting confused by the stuff going on in the body of the loop.	[reply] [d/l] [select]