comment on

To expand on the idea of the Boyer-Moore search, I think the fastest way to find if your words are contained in a string is a trade off in startup time versus runtime, as the startup cost can be cached.

The idea is to create a (huge) state machine from your dictionary that outputs a word after it's been found, while inching through your string. I'm not sure if the idea is feasible, as the amount of memory for storing the machine isn't negligible for large dictionaries. I currently think of using hashes for the word-tree, but possibly arrays are faster. The fastest would be to build a huge string out of the arrays and then use a tight C loop to run over it - demperhq did something similar (longest prefix matching I think). An example tree could be like this:

  # a negative number indicates how many chars you just matched
  a -> p -> p -> l -> e -> -5 
  b -> a -> r -> -3
    -> l -> o -> s -> s -> o -> m -> -7
              -> u -> s -> e -> -6
  u -> s -> e -> -3
[download]

The idea now is to inch through the string, keeping track of the "running" words, advancing them all down trough the tree, and adding a new "running word" for the current character, starting at the top. If you don't find the current char in the tree at the current position, there was no word. If you end up at a negative number, your word started that many characters away.

Building that tree might be time consuming, but you can store it after creation. Walking that tree should be fairly fast - you might consider doing that in C if speed becomes the central issue.

Update: After thinking a short bit, and writing code a short bit, here is my try at it. I didn't optimize the code, so the matching loop might still bear potential for optimization, for example the cleaning out of non-matching words, and the calculation of the matched string.

Update 2: After thinking about it shortly, this only finds the longest word starting at each character, and could miss out on bee, if there are bee, beetle in the dictionary and only beetl in the string. I've updated the code to catch such instances as well - now it outputs all matches, even submatches, such as bee in beetle.

#!/usr/bin/perl -w
use strict;
use Data::Dumper;

my $dict = {};

while (<DATA>) {
  chomp;
  insert($_);
};

sub insert {
  my @chars = split '', shift;
  my $len = - scalar @chars;
  my $curr = $dict;
  while (@chars) {
    my $char = shift @chars;
    if (! exists $curr->{$char}) { $curr->{$char} = {} };
    $curr = $curr->{$char};
  };
  $curr->{''} = $len;
};

# print Dumper $dict;

my $str = 'owijfwapplelaskfiwejfcherryalkfwiofwfblossomowiejfbeetl';

my $pos = 0;

my @running;

while ($pos < length $str) {
  my $char = substr( $str, $pos, 1 );
  push @running, $dict;
  for my $word (@running) {
    if (exists $word->{''}) {
      print "Found ", substr( $str, $pos + $word->{''}, - $word->{''} 
+), "\n";
    };
    if (exists $word->{$char}) {
      $word = $word->{$char};
    } else {
      undef $word;
    };
  };
  @running = grep { defined } @running;
  $pos++;
};

__DATA__
apple
blossom
cherry
blouse
bar
bee
beetle
[download]

In reply to Re: Finding dictionary words in a string. by Corion
in thread Finding dictionary words in a string. by ehdonhon

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.