Is Perl's internal hash implementation likely to offer sufficiently efficient alternatives?

Ostensibly all you need is:

my %lexicon = map{ <word> => <supplementary data> } $csv -> parse # + words to match against, and supplementary data to go with matches foreach <tweet> { foreach <word_in_tweet> { if( exists $lexicon{ <word_in_tweet> } ) { save $lexicon{ <word_in_tweet> } TO tweet.result_data; } } }

A hash will out perform the regex engine trie hands down in terms of speed.

A quick test shows that running the 216,000 words in an html version of The Origin of Species against my 179,000 word dictionary using a hash takes 0.17 seconds.

However, using the regex engines built-in trie (to hold the 179,000 word dictionary), is processing ~10 lines per second which means it should be finished after ~38 minutes (Update:it took 34.5 minutes):

#! perl -slw use strict; use Data::Dump qw[ pp ]; use Time::HiRes qw[ time ]; chomp( my @words = do{ local @ARGV = 'c:/test/words.txt'; <> } ); my %lexicon; $lexicon{ $_ } = 'suplementary data' for @words; my $re = '(' . join( '|', sort{ length( $a ) <=> length( $b ) } @words + ) . ')'; #print $re; exit; open my $infile, '<', $ARGV[ 0 ] or die $!; my $start1 = time; seek $infile, 0, 0; my( $words, $found1 ) = ( 0, 0 ); while( <$infile> ) { printf "\r$.\t"; tr[a-zA-Z][ ]cs; for my $word ( split ) { ++$words; ++$found1 if exists $lexicon{ $word }; } } my $end1 = time; printf "Finding $found1 words (of $words) took %f seconds using a hash +\n", $end1 - $start1; my $start2 = time; seek $infile, 0, 0; $. = 1; my $found2 = 0; while( <$infile> ) { printf "\r$.\t"; tr[a-zA-Z][ ]cs; tr[A-Z][a-z]; ++$found2 while m[$re]g; } my $end2 = time; printf "Finding $found2 words took %f seconds using a trie(via regex e +ngine)\n", $end2 - $start2; __END__ C:\docs\OriginOfSpecies(Darwin)\2009-h>\perl5.18\bin\perl.exe \test\10 +43602.pl 2009-h.htm Finding 203474 words (of 216808) took 0.173504 seconds using a hash Finding 203474 words took 2072.099258 seconds using a trie(via regex e +ngine)

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Re: Efficient matching with accompanying data by BrowserUk
in thread Efficient matching with accompanying data by Endless

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.