Re: Efficient matching with accompanying data

Is Perl's internal hash implementation likely to offer sufficiently efficient alternatives?

Ostensibly all you need is:

my %lexicon = map{ <word> => <supplementary data> } $csv -> parse    #
+ words to match against, and supplementary data to go with matches
foreach <tweet> {
    foreach <word_in_tweet> {
        if( exists $lexicon{ <word_in_tweet> } ) {
            save $lexicon{ <word_in_tweet> } TO tweet.result_data;
        }
    }
}
[download]

A hash will out perform the regex engine trie hands down in terms of speed.

A quick test shows that running the 216,000 words in an html version of The Origin of Species against my 179,000 word dictionary using a hash takes 0.17 seconds.

However, using the regex engines built-in trie (to hold the 179,000 word dictionary), is processing ~10 lines per second which means it should be finished after ~38 minutes (Update:it took 34.5 minutes):

#! perl -slw
use strict;
use Data::Dump qw[ pp ];
use Time::HiRes qw[ time ];

chomp( my @words = do{ local @ARGV = 'c:/test/words.txt'; <> } );

my %lexicon; $lexicon{ $_ } = 'suplementary data' for @words;


my $re = '(' . join( '|', sort{ length( $a ) <=> length( $b ) } @words
+ ) . ')';

#print $re; exit;

open my $infile, '<', $ARGV[ 0 ] or die $!;

my $start1 = time;
seek $infile, 0, 0;
my( $words, $found1 ) = ( 0, 0 );
while( <$infile> ) {
    printf "\r$.\t";
    tr[a-zA-Z][ ]cs;
    for my $word ( split ) {
        ++$words;
        ++$found1 if exists $lexicon{ $word };
    }
}
my $end1 = time;

printf "Finding $found1 words (of $words) took %f seconds using a hash
+\n", $end1 - $start1;

my $start2 = time;
seek $infile, 0, 0; $. = 1;
my $found2 = 0;
while( <$infile> ) {
    printf "\r$.\t";
    tr[a-zA-Z][ ]cs;
    tr[A-Z][a-z];
    ++$found2 while m[$re]g;
}
my $end2 = time;

printf "Finding $found2 words took %f seconds using a trie(via regex e
+ngine)\n", $end2 - $start2;
__END__
C:\docs\OriginOfSpecies(Darwin)\2009-h>\perl5.18\bin\perl.exe \test\10
+43602.pl 2009-h.htm
Finding 203474 words (of 216808) took 0.173504 seconds using a hash
Finding 203474 words took 2072.099258 seconds using a trie(via regex e
+ngine)
[download]

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

Comment on Re: Efficient matching with accompanying data Select or Download Code

Replies are listed 'Best First'.
Re^2: Efficient matching with accompanying data by Endless (Beadle) on Jul 11, 2013 at 23:17 UTC
Thanks for your suggestion! The hash is, indeed, enormously more effective than either Trie or Regex turned out to be for me.	[reply]
Re^2: Efficient matching with accompanying data by Anonymous Monk on Jul 11, 2013 at 13:11 UTC
Given that timing I suspect the trie optimization was not used. If the regexp is too long, perl gives up on the trie optimization and resorts to standard alternation. I ran into this a while ago and split the patterns into a set of regexps that where small enough to do the optimization. I then ran that set one after the other. The regexp debug output will tell you whether you are actually getting the tries.	[reply]
Re^3: Efficient matching with accompanying data (${^RE_TRIE_MAXBUF}) by LanX (Saint) on Jul 11, 2013 at 13:39 UTC
> Given that timing I suspect the trie optimization was not used. Sure, see also `${^RE_TRIE_MAXBUF}` in perlvar ${^RE_TRIE_MAXBUF} Controls how certain regex optimisations are applied an +d how much memory they utilize. This value by default is 6553 +6 which corresponds to a 512kB temporary cache. Set this to a h +igher value to trade memory for speed when matching large alternations. Set it to a lower value if you want the optimisations to be as conservative of memory as possib +le but still occur, and set it to a negative value to prevent +the optimisation and conserve the most memory. Under norma +l situations this variable should be of no interest to yo +u. [download] Disabling the buffer is also a mean to show the effect of a missing optimization. It's also much faster to join all strings to be inspected on the LHS and to run the regex just once. But if only word-boundaries are needed, I suppose hash-lookups can still be faster, of course depending on "density" and "distance" of the searched patterns. Cheers Rolf ( addicted to the Perl Programming Language)	[reply] [d/l] [select]
Re^3: Efficient matching with accompanying data by BrowserUk (Patriarch) on Jul 11, 2013 at 15:28 UTC
Hm. I just tried setting `${^RE_TRIE_MAXBUF}` to every power of 2 from 18 through 32 and it didn't make a blind bit of difference. ... our $M //= 16; ${^RE_TRIE_MAXBUF} = 2**$M; ... __END__ C:\test>for /l %i in (18,1,32) do @\perl5.18\bin\perl.exe -s 1043602.pl -M=%i C:\docs\OriginOfSpecie +s(Darwin)\2009-h\2009-h.htm Finding 203474 words (of 216808) took 0.175862 seconds using a hash 201 Finding 203474 words (of 216808) took 0.174275 seconds using a + hash 249 Finding 203474 words (of 216808) took 0.173872 seconds using a + hash 174 Finding 203474 words (of 216808) took 0.174906 seconds using a + hash 201 Finding 203474 words (of 216808) took 0.175149 seconds using a + hash 205 Finding 203474 words (of 216808) took 0.172946 seconds using a + hash 183 Finding 203474 words (of 216808) took 0.176250 seconds using a + hash 196 Finding 203474 words (of 216808) took 0.175480 seconds using a + hash 201 Finding 203474 words (of 216808) took 0.174416 seconds using a + hash 196 Finding 203474 words (of 216808) took 0.179129 seconds using a + hash 210 Finding 203474 words (of 216808) took 0.174478 seconds using a + hash 197 Finding 203474 words (of 216808) took 0.174674 seconds using a + hash 183 Finding 203474 words (of 216808) took 0.177501 seconds using a + hash 225 Finding 203474 words (of 216808) took 0.174395 seconds using a + hash 197 Finding 203474 words (of 216808) took 0.178181 seconds using a + hash 221 C:\test> [download] Each run was manually killed after checking that the setting made no difference to the memory usage and ~20 seconds had elapsed. As you can see, it was still only processing ~10 line/s. Personally, I've yet to see a situation where the so-called trie optimisation benefited my code. (See also Re: pattern match, speed problem) With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^4: Efficient matching with accompanying data by LanX (Saint) on Jul 11, 2013 at 15:59 UTC
I have no time to look into your code but your benchmarks always talk about "... seconds using a hash" Cheers Rolf ( addicted to the Perl Programming Language)	[reply]
Re^5: Efficient matching with accompanying data by BrowserUk (Patriarch) on Jul 11, 2013 at 16:25 UTC
Re^6: Efficient matching with accompanying data by LanX (Saint) on Jul 12, 2013 at 01:40 UTC
Some notes below your chosen depth have not been shown here
Re^6: Efficient matching with accompanying data by Anonymous Monk on Jul 12, 2013 at 08:38 UTC