in reply to Efficient matching with accompanying data
Is Perl's internal hash implementation likely to offer sufficiently efficient alternatives?
Ostensibly all you need is:
my %lexicon = map{ <word> => <supplementary data> } $csv -> parse # + words to match against, and supplementary data to go with matches foreach <tweet> { foreach <word_in_tweet> { if( exists $lexicon{ <word_in_tweet> } ) { save $lexicon{ <word_in_tweet> } TO tweet.result_data; } } }
A hash will out perform the regex engine trie hands down in terms of speed.
A quick test shows that running the 216,000 words in an html version of The Origin of Species against my 179,000 word dictionary using a hash takes 0.17 seconds.
However, using the regex engines built-in trie (to hold the 179,000 word dictionary), is processing ~10 lines per second which means it should be finished after ~38 minutes (Update:it took 34.5 minutes):
#! perl -slw use strict; use Data::Dump qw[ pp ]; use Time::HiRes qw[ time ]; chomp( my @words = do{ local @ARGV = 'c:/test/words.txt'; <> } ); my %lexicon; $lexicon{ $_ } = 'suplementary data' for @words; my $re = '(' . join( '|', sort{ length( $a ) <=> length( $b ) } @words + ) . ')'; #print $re; exit; open my $infile, '<', $ARGV[ 0 ] or die $!; my $start1 = time; seek $infile, 0, 0; my( $words, $found1 ) = ( 0, 0 ); while( <$infile> ) { printf "\r$.\t"; tr[a-zA-Z][ ]cs; for my $word ( split ) { ++$words; ++$found1 if exists $lexicon{ $word }; } } my $end1 = time; printf "Finding $found1 words (of $words) took %f seconds using a hash +\n", $end1 - $start1; my $start2 = time; seek $infile, 0, 0; $. = 1; my $found2 = 0; while( <$infile> ) { printf "\r$.\t"; tr[a-zA-Z][ ]cs; tr[A-Z][a-z]; ++$found2 while m[$re]g; } my $end2 = time; printf "Finding $found2 words took %f seconds using a trie(via regex e +ngine)\n", $end2 - $start2; __END__ C:\docs\OriginOfSpecies(Darwin)\2009-h>\perl5.18\bin\perl.exe \test\10 +43602.pl 2009-h.htm Finding 203474 words (of 216808) took 0.173504 seconds using a hash Finding 203474 words took 2072.099258 seconds using a trie(via regex e +ngine)
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Efficient matching with accompanying data
by Endless (Beadle) on Jul 11, 2013 at 23:17 UTC | |
|
Re^2: Efficient matching with accompanying data
by Anonymous Monk on Jul 11, 2013 at 13:11 UTC | |
by LanX (Saint) on Jul 11, 2013 at 13:39 UTC | |
by BrowserUk (Patriarch) on Jul 11, 2013 at 15:28 UTC | |
by LanX (Saint) on Jul 11, 2013 at 15:59 UTC | |
by BrowserUk (Patriarch) on Jul 11, 2013 at 16:25 UTC | |
by LanX (Saint) on Jul 12, 2013 at 01:40 UTC | |
| |
by Anonymous Monk on Jul 12, 2013 at 08:38 UTC |