in reply to weird subroutine behavior

I ran the solution proposed by kyle against my corpus of documents and I found another exception.

If you run only "section 1" with the below data input, the solution proposed by kyle will introduce the improper behavior of the subroutine, while the original version will work ok.

my $st5 = 'PASAI OUN AI GENEAI APO ABRAAM EWS DABID GENEAI DEKATESSARE +S KAI APO DABID EWS THS METOIKESIAS BABULWNOS GENEAI DEKATESSARES KAI + APO THS METOIKESIAS BABULWNOS EWS TOU XRISTOU GENEAI DEKATESSARES'; my $st6 = 'PASAI OUN AI GENEAI APO ABRAAM EWS DAUID GENEAI DEKATESSARE +S KAI APO DAUID EWS THS METOIKESIAS BABULWNOS GENEAI DEKATESSARES KAI + APO THS METOIKESIAS BABULWNOS EWS TOU XRISTOU GENEAI DEKATESSARES';

Note: add and remove the "|| $b cmp $a" from the end of "sort" to see the different behavior.

I really appreciate your time and efforts to help me find a solution for this weird and difficult problem.

Replies are listed 'Best First'.
Re: **reopened**Re: weird subroutine behavior
by kyle (Abbot) on Apr 03, 2008 at 15:37 UTC

    I looked at this some more last night, but I never came up with a solution. I did come up with a test framework to make it easier to try things, so I'll post that.

    I also refactored a little.

    At this point, I suspect that whatever algorithm you're using just isn't doing what you expect it to do. Since I still can't tell what it's supposed to be doing, I can't be sure.

    I very much suggest practicing naming your variables. %substrings never contains any strings—keys or values. @matrix is not very descriptive. I never did figure out what %map1 and %map2 were really supposed to do.

    I wish I could offer more help here. Good luck with your problem.

      Thanks a lot for your help. I will try to explain below what I am trying to do in the subroutine "all".

      Purpose:
      - to find ALL common word groups and return them in the reverse order of the word count, if two groups have the same word count, the group showing up earlier in the string it is preferred.

      Algorithm:
      - I am using the dynamic algorithm for LCSS modified to return all substrings but remove the strings that overlap (the returned strings (common) + the new strings (not selected) = original sentence)
      - the matrix keeps track of occurrences - you can find here http://en.wikipedia.org/wiki/Longest_common_substring_problem an explanation of the dynamic algorithm.
      - while I am completing the matrix of occurrences I record in the hash %substrings the start and end index of the detected common substrings. This will allow me later on to eliminate the overlapping substrings.
      - so, the first two values of substrings are for occurrences in the first string, the last two for occurrences in the second string.
      - now, it comes the foreach statement:
      * map1 and map2 are mask arrays for the substrings
      * if I got one substring from indexes 4-10 and then I get another substring between 2-5 (overlapping on the longer one), then I have to adjust the last one to 2-3. If it is going to become blank, I simply reject it.
      * reason for "sort": I have to start from substrings with the biggest word count in order to make sure I do not consider a substring which is contained in a bigger string.

      I am not sure I was very clear, but please let me know what I can detail more and explain better.

        Thanks for your reply. That was illuminating!

        This is what I wound up with (passes all tests):

        This hasn't changed very much. In @substr_mat, instead of strings, I put hash refs. Each hash ref has in it the string, the word count, and the site where the string was found. The site is measured in words, so if you have "BLAHBLAH FOO" and "BLAH BAR", "FOO" and "BAR" are considered to be at the same "site".

        That data structure looks like this:

        (I just now noticed my word count is off by one. This isn't a problem for us because it will still sort correctly.)

        So, before returning, I sort by the word count, then the site, and finally pass it through a map to turn it into simple strings.

        I get the impression that the sort at the top of the foreach is supposed to do this work, but I think it's getting confused by the stuff going on in the body of the loop.