in reply to Re: **reopened**Re: weird subroutine behavior
in thread weird subroutine behavior

Thanks a lot for your help. I will try to explain below what I am trying to do in the subroutine "all".

Purpose:
- to find ALL common word groups and return them in the reverse order of the word count, if two groups have the same word count, the group showing up earlier in the string it is preferred.

Algorithm:
- I am using the dynamic algorithm for LCSS modified to return all substrings but remove the strings that overlap (the returned strings (common) + the new strings (not selected) = original sentence)
- the matrix keeps track of occurrences - you can find here http://en.wikipedia.org/wiki/Longest_common_substring_problem an explanation of the dynamic algorithm.
- while I am completing the matrix of occurrences I record in the hash %substrings the start and end index of the detected common substrings. This will allow me later on to eliminate the overlapping substrings.
- so, the first two values of substrings are for occurrences in the first string, the last two for occurrences in the second string.
- now, it comes the foreach statement:
* map1 and map2 are mask arrays for the substrings
* if I got one substring from indexes 4-10 and then I get another substring between 2-5 (overlapping on the longer one), then I have to adjust the last one to 2-3. If it is going to become blank, I simply reject it.
* reason for "sort": I have to start from substrings with the biggest word count in order to make sure I do not consider a substring which is contained in a bigger string.

I am not sure I was very clear, but please let me know what I can detail more and explain better.

  • Comment on Re^2: **reopened**Re: weird subroutine behavior

Replies are listed 'Best First'.
Re^3: **reopened**Re: weird subroutine behavior
by kyle (Abbot) on Apr 03, 2008 at 19:24 UTC

    Thanks for your reply. That was illuminating!

    This is what I wound up with (passes all tests):

    This hasn't changed very much. In @substr_mat, instead of strings, I put hash refs. Each hash ref has in it the string, the word count, and the site where the string was found. The site is measured in words, so if you have "BLAHBLAH FOO" and "BLAH BAR", "FOO" and "BAR" are considered to be at the same "site".

    That data structure looks like this:

    (I just now noticed my word count is off by one. This isn't a problem for us because it will still sort correctly.)

    So, before returning, I sort by the word count, then the site, and finally pass it through a map to turn it into simple strings.

    I get the impression that the sort at the top of the foreach is supposed to do this work, but I think it's getting confused by the stuff going on in the body of the loop.