comment on

Unfortunately I've spend some time I don't have into this! :(

As it seems that trie optimization brakes for more than 15000 alternative patterns.

up to this number the performance is competitive:

481Finding 883 words (of 5038) took 0.039814 seconds using a hash
482Finding 784 words took 0.027246 seconds using a trie(via regex engi
+ne)
[download]

please note that I somewhere when experimenting I introduced a bug which skipped 100 matches...

couldn't find the whole of Darwin's text and took chapter2, and as a dictionary I took (and cleaned) the en-US.dic in mozillas path.

Please note some changes I did:

1. I sorted descending by size not ascending, to fulfill the OP's requirements of multiword matches.

2. You didn't use word delimiter, such that the regex has to start matching in the middle of the word. Thats why I put spaces around the parens.

3. I don't know why you lowercased the input, my dict is case-sensitive, as a side note perl has lc

4. I don't see the point of looping over m//g if you can get all matches in one attempt.

5. I precompiled the regex to avoid recompilations.

> I'd love to see a demonstration of the trie optimisation actually doing something.

OK, I'm posting the code now as it is as a starting point for everyone who wants to experiment with it: Disabling the buffer (negative value) will show the effect of missing trie.

have fun!

 use strict;
use Data::Dump qw[ pp ];
use Time::HiRes qw[ time ];

chomp( my @words = do{ local @ARGV = 'en-US.dic'; <> } );

${^RE_TRIE_MAXBUF}=2**16;
$|=1;


my %lexicon;

my $limit=10000;
for (@words) {
  s/\/.*$//;
#  next if length($_)<3;
  last unless $limit--;
  $lexicon{ $_ } = 'suplementary data';
}

#print join "\t", grep {length() <3 } keys %lexicon ;exit;


my $re = ' (' . join( '|', sort{ length( $b ) <=> length( $a ) } keys 
+%lexicon ) . ') ';
my $cre = qr/$re/;

#print $re; exit;

open my $infile, '<', $ARGV[ 0 ] or die $!;

my @matches1;
my $start1 = time;
seek $infile, 0, 0;
my( $words, $found1 ) = ( 0, 0 );
while( <$infile> ) {
    printf "\r$.\t";
    tr[a-zA-Z][ ]cs;                        # 
    tr[A-Z][a-z];
    for my $word ( split ) {
        ++$words;
        if (exists $lexicon{ $word }) {
      $found1++;
      push @matches1,$word;
    }
    
    }
}
my $end1 = time;

printf "Finding $found1 words (of $words) took %f seconds using a hash
+\n", $end1 - $start1;


my $start2 = time;
seek $infile, 0, 0; $. = 1;
my $found2 = 0;
my $text="";
while( <$infile> ) {
    printf "\r$.\t";
    tr[a-zA-Z][ ]cs;
    tr[A-Z][a-z];
#    ++$found2 while m[$cre]g;
    $text.=$_." ";
  }

my @matches2 = $text =~ /$cre/g;
$found2=scalar @matches2;
my $end2 = time;

printf "Finding $found2 words took %f seconds using a trie(via regex e
+ngine)\n", $end2 - $start2;

my %matches;
@matches{@matches1}=();
print scalar keys %matches;
delete @matches{@matches2};
print "missing matches:\n";

pp \%matches;
#print $text;
[download]

Cheers Rolf

( addicted to the Perl Programming Language)

PS: The requirements of the OP weren't clear for me, if only whole words are of interest I recommend sticking with the hash method.

In reply to Re^6: Efficient matching with accompanying data by LanX
in thread Efficient matching with accompanying data by Endless

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.