in reply to Measuring Substrings Problem
Here's a fairly simply approach:
sub score{ my($str, $array) = @_; # ensure that longer string comes before its prefix my @substring = sort { $b cmp $a } @$array; my $re = join '|', map "(?=($_))", @substring; my($count, $next) = (0, 0); while ($str =~ /$re/g) { # $-[0] is the position at which we matched # @- describes the matched captures, so $#- is the actual capture +matched my($start, $which) = ($-[0], $#- - 1); my $end = $start + length($substring[$which]); $next = $start if $next < $start; next if $end < $next; $count += $end - $next; $next = $end; } return $count; }
This assumes that you may want substrings of varying lengths within the array, and that you may even have one substring in a set that is an exact prefix of another - if you don't need to allow for one or both of those possibilities, the code could be simplified a bit further.
It does assume however that the substrings are simple strings to match directly rather than regexps in their own right: it would otherwise need a different approach to discovering the length of each match.
The idea is to construct a regexp that will match any of the strings at any position by turning each into a lookahead; and to make each substring a capture so that we know which matched, and can therefore work out the length of the match.
The rest of the code remembers what positions have already been catered for to avoid double-counting.
Hugo
|
|---|