Is Perl's internal hash implementation likely to offer sufficiently efficient alternatives?
Ostensibly all you need is:
my %lexicon = map{ <word> => <supplementary data> } $csv -> parse # + words to match against, and supplementary data to go with matches foreach <tweet> { foreach <word_in_tweet> { if( exists $lexicon{ <word_in_tweet> } ) { save $lexicon{ <word_in_tweet> } TO tweet.result_data; } } }
A hash will out perform the regex engine trie hands down in terms of speed.
A quick test shows that running the 216,000 words in an html version of The Origin of Species against my 179,000 word dictionary using a hash takes 0.17 seconds.
However, using the regex engines built-in trie (to hold the 179,000 word dictionary), is processing ~10 lines per second which means it should be finished after ~38 minutes (Update:it took 34.5 minutes):
#! perl -slw use strict; use Data::Dump qw[ pp ]; use Time::HiRes qw[ time ]; chomp( my @words = do{ local @ARGV = 'c:/test/words.txt'; <> } ); my %lexicon; $lexicon{ $_ } = 'suplementary data' for @words; my $re = '(' . join( '|', sort{ length( $a ) <=> length( $b ) } @words + ) . ')'; #print $re; exit; open my $infile, '<', $ARGV[ 0 ] or die $!; my $start1 = time; seek $infile, 0, 0; my( $words, $found1 ) = ( 0, 0 ); while( <$infile> ) { printf "\r$.\t"; tr[a-zA-Z][ ]cs; for my $word ( split ) { ++$words; ++$found1 if exists $lexicon{ $word }; } } my $end1 = time; printf "Finding $found1 words (of $words) took %f seconds using a hash +\n", $end1 - $start1; my $start2 = time; seek $infile, 0, 0; $. = 1; my $found2 = 0; while( <$infile> ) { printf "\r$.\t"; tr[a-zA-Z][ ]cs; tr[A-Z][a-z]; ++$found2 while m[$re]g; } my $end2 = time; printf "Finding $found2 words took %f seconds using a trie(via regex e +ngine)\n", $end2 - $start2; __END__ C:\docs\OriginOfSpecies(Darwin)\2009-h>\perl5.18\bin\perl.exe \test\10 +43602.pl 2009-h.htm Finding 203474 words (of 216808) took 0.173504 seconds using a hash Finding 203474 words took 2072.099258 seconds using a trie(via regex e +ngine)
In reply to Re: Efficient matching with accompanying data
by BrowserUk
in thread Efficient matching with accompanying data
by Endless
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |