in reply to Re: Re: Optimizing a string processing sub
in thread Optimizing a string processing sub

You are right, 'aabcc' and 'abbbc' scoring 5 is a mistake. A score of 5 for two 5 letter words means that they are a permutation of one another. For instance, 'abcde' and 'eacdb' should score 5.

Sorry for being vague about the requirements... They really are vague ! Maybe examples can help:

aabcc, abbbc -> 3 (once a, once b and once c, that's it)
stress, super -> 3 (once s, e and r)
abcde, caebd -> 5 (permutation)
abcde, caebdxxy -> 5 (doesn't change things)
  • Comment on Re: Re: Re: Optimizing a string processing sub

Replies are listed 'Best First'.
Re: Re: Re: Re: Optimizing a string processing sub
by MarkM (Curate) on Jan 09, 2003 at 06:20 UTC

    I put some more thought into the exact trouble I have with dragonchild's second solution. It works in all cases where each unique character in word1 shows up fewer times or an equal numbers of times as they do in word2 (or vice versa). The reason is simple. It calculates the number of characters in word1 that are in word2, and then word2 in word1, and returns the lower of the two values. It does not consider the possibility that some characters will show up fewer times in word1, and some characters will show up fewer times in word2. (Example: "aabccc" and "abbbbc" yield 6 with this solution - very unexpected)

    Rather than merely complaining, I will offer a solution:

    sub score { my($word1, $word2) = @_; return $words_are_equal_score if $word1 eq $word2; my $score = 0; for ($word1 =~ /(.)/g) { $score++ if $word2 =~ s/\Q$_\E//; } $score; }

    For the border case (the example) described above, this solution scores "aabccc" and "abbbbc" with the value 3.

    The trick is that the "$word2 =~ s/\Q$_\E//" ensures that characters in $word2 will not be considered twice, because they are removed as they are counted. I will continue to think of a method of accelerating this function. I wanted to get one, fast, correct answer in this thread out first. :-)

      Nicely done.

      I think bart wins first place though and, to be fair, dragonchild's was the first that matched the example code in the original node.

      Excellent point about dragonchild's second solution, by the way. Although, it's still valid for some interpretation of the specs. :-)

      -sauoq
      "My two cents aren't worth a dime.";
      
      Very good catch, MarkM!. I didn't really thoroughly test my solution, leaving that as an exercise for the reader.

      As for my solution, although this wasn't in the original specs, I also wanted to solve the problem without using some variant of split (which /(.)/g is). However, I'm blanking on it.

      Although this may have been posted in the mainline discussion, I'll post this, as it Benchmark's faster than yours.

      sub score { my ($x, $y) = @_; return -1 if $x eq $y; my %x; $x{$_}++ for $x =~ /./g; my $score = 0; $x{$_} && $score++ for $y =~/./g; $score; }
      For 1 million iterations, this benches at 74.56sec. Yours benches at 122.51sec. :-)

      ------
      We are the carpenters and bricklayers of the Information Age.

      Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.