diggler has asked for the wisdom of the Perl Monks concerning the following question:

say you have an input string $str='George Bush is a psycho' and you want to test it against a list of substrings, @sub = ('George', 'George Bush'). You could test each substring in turn, but if the list is large, this wouldn't be efficient. Wouldn't a modified version of Tree::Ternary_XS solve this problem? I've seen Text::Scan, but it doesn't seem that it would work well here. has anybody seen something like this, or is there a better way to solve the problem?
  • Comment on efficient method of matching a string against a list of substrings?

Replies are listed 'Best First'.
Re: efficient method of matching a string against a list of substrings?
by Roy Johnson (Monsignor) on May 05, 2005 at 03:18 UTC
    Create a hash of arrays where the keys are the first two chars of your candidate substrings and the values are the substrings that begin with those chars. e.g.,
    $hash{'Ge'} = ['George', 'George Bush'];
    Then you can run through your hash, checking for the presence of the keys in your string. When a key is found, you look for its values. It might be something of an optimization to remember the pos where the key was found so your value searches could start there.

    Also, study might be your friend for this, if you use a regex rather than index.


    Caution: Contents may have been coded under pressure.
Re: efficient method of matching a string against a list of substrings?
by dave_the_m (Monsignor) on May 05, 2005 at 00:49 UTC
    my $re = join '|', @sub; $re = qr/($re)/; $str =~ $re;

    Dave.

      very large regexes can be even more inefficient. when i say large list, think hundreds or thousands of strings.
Re: efficient method of matching a string against a list of substrings?
by ihb (Deacon) on May 05, 2005 at 01:08 UTC

    Check out Regexp::PreSuf and Regexp::Assemble. They combine lists of words or patterns into a hopefully more efficient pattern than a trivial ORed expression.

    ihb

    See perltoc if you don't know which perldoc to read!

      as noted in Regexp:Assemble:
      You should realise that large numbers of alternations are processed in perl's regular expression engine in O(n) time, not O(1). If you are still having performance problems, you should look at using a trie.
      regexes are often more convenient, rather than efficient. but thanks for pointing me there, because the note also mentions:
      ... Perl's own regular expression engine will implement trie optimisations in perl 5.10 (they are already available in perl 5.9.3 if you want to try them out).
      this would probably solve the problem in the future, but i need to find something for the present.

        For the present you can/should use R::A or one of the other regex modules as they should be patched in time of 5.10 to utilize the TRIE optimisation if possible. OTOH You havent really spelled out if you mean an anchored match or not. Ie is it /^(list|of|strings)$/ or is it /^(list|of|strings)/ or is it /(list|of|strings)/? Depending on which there may be non regex based strategies that are quite competitive.

        ---
        $world=~s/war/peace/g

Re: efficient method of matching a string against a list of substrings?
by jhourcle (Prior) on May 05, 2005 at 02:51 UTC

    In the case of your searches, I'd have to ask how many items you have, and how many of those items are subsets of items. (eg, if 'George' fails, there's no reason to test for 'George Bush' ... as the list gets larger, there's a better chance of this coming up) And even if they aren't in there directly, we may have other cheats. (no 'G' in the item? Well, then neither of the items will match... a simple test may be able to strip out large sections of your list)

    Then we have the question of how many input strings you're searching, in relationship to the number of substrings. (if we're searching only one input string per invocation, we lose if we spend too much time trying to pre-process the list of substrings)

    Also, what is the overall purpose, and does the order of the initial list matter? (for instance -- if we're using it in a switch() statement, once we've found the first match, we don't continue... but we need to make sure the list order doesn't change, or we'll potentially have inconsistent results if more than one list item matches)

Re: efficient method of matching a string against a list of substrings?
by polettix (Vicar) on May 05, 2005 at 12:22 UTC
    I recently had a similar problem and I was to write a post to ask comments about the solution - go figure! I needed to do tests for much less elements than you (around 10000 entries), but I didn't like the idea of a raw array search, while holding the commitment to Keep It Simple (Stupid).

    I used a HoH in which the first key is the length of the string to be tested; when I have to check for existence, I iterate through lengths and test existence in each sub-hash.

    # Use an HoH instead of simple array my %categories; foreach (@sub) { $categories{length($_)}{$_} = undef; } # ... when you need it... sub matches_any { my $t = shift; foreach my $len (keys %categories) { # We could also sort here... return 1 if exists $categories{$len}{substr($t, 0, $len)}; } return 0; }
    Note that in my problem I know that the incoming strings are always longer than those I run tests against, so I skipped testing if length($t) is actually greater or equal to $len. Moreover, I don't have a big variance in lengths, so $len should span few values.

    Flavio (perl -e 'print(scalar(reverse("\nti.xittelop\@oivalf")))')

    Don't fool yourself.
Re: efficient method of matching a string against a list of substrings?
by thcsoft (Monk) on May 05, 2005 at 02:43 UTC
    my @results = grep { $str =~ /$_/ } @sub;

    ;)

    language is a virus from outer space.
Re: efficient method of matching a string against a list of substrings?
by TedPride (Priest) on May 05, 2005 at 07:39 UTC
    What you do is create a list of pointers that correspond to the location of each character in the input string. Then sort the pointers in alphabetic order. All you have to do now is sort your substrings in alphabetic order as well and step through each list once, matching as you go.

    This might be more efficiently done in C++.

    EDIT: How efficient do you need to be? Searching a 100,000-character string for 100,000 10-character substrings took me 57 seconds -

    use strict; use warnings; my ($input, $str, @str, $t); $input .= chr(97 + int rand 26) for 1..100000; for (1..100000) { $str = ''; $str .= chr(97 + int rand 26) for 1..10; push @str, $str; } $t = time(); for (@str) { print "$_\n" if index($input, $str) != -1; } print time() - $t;
    To get noticeably more efficient than this, you really need a lower level language like C++.
Re: efficient method of matching a string against a list of substrings?
by mrborisguy (Hermit) on May 05, 2005 at 03:04 UTC
    the first thing that comes to mind for me may be alot of work.

    i would start with something like a binary search tree, hopefully balanced (so red-black tree or avl tree - the only balanced trees i have studied). then maybe just work through the string letter by letter, matching letters as you go along with letters in the tree. i would think if you had a good algorithm, this could speed things up... maybe.

    just my thoughts, throwing them out there for you.