Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re^3: Using indexing for faster lookup in large file

by Your Mother (Archbishop)
on Mar 05, 2015 at 04:37 UTC ( [id://1118835]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Using indexing for faster lookup in large file
in thread Using indexing for faster lookup in large file

Hmmm… Indexing is expensive but it shouldn’t be so long for such a simple data set. Perhaps duplicate data in your autogeneration is harder to segment/index…? Sorry to plunk so much more down without <readmore/> but I hate clicking on them and this is likely the dying ember of the thread. Anyway, here’s Wonderwall.

Fake Data Maker

Took 8 minutes to generate 30G “db” that felt like a decent facsimile for the purposes here.

use 5.014; use strictures; use List::Util "shuffle"; open my $words, "<", "/usr/share/dict/words" or die $!; chomp ( my @words = <$words> ); my $top = @words - 40; @words = shuffle @words; open my $db, ">", "/tmp/PM.db" or die $!; for my $id ( 999_999 .. 999_999_999 ) { use integer; my $end = rand($top); my $range = rand(35) + 5; my $start = $end - $range; $start = 0 if $start < 0; say {$db} join ";", $id, shuffle @words[ $start .. $end ]; last if -s $db > 32_000_000_000; }

Indexer

Took 5h:32m to index 30G of 126,871,745 records. This is a relatively powerful Mac. I suspect doing commits less frequently or only at the end would speed it up a bit but you can only search “live” what’s been committed during indexing.

use 5.014; use strictures; use Lucy; my $index = "./lucy.index"; my $schema = Lucy::Plan::Schema->new; my $easyanalyzer = Lucy::Analysis::EasyAnalyzer ->new( language => 'en' ); my $text_type = Lucy::Plan::FullTextType ->new( analyzer => $easyanalyzer, ); my $string_type = Lucy::Plan::StringType->new(); $schema->spec_field( name => 'id', type => $string_type ); $schema->spec_field( name => 'content', type => $text_type ); open my $db, "<", "/tmp/PM.db" or die $!; my $indexer = get_indexer(); my $counter = 1; while (<$db>) { chomp; my ( $id, $text ) = split /;/, $_, 2; $indexer->add_doc({ id => $id, content => $text }); unless ( $counter++ % 100_000 ) { print "committing a batch...\n"; $indexer->commit; $indexer = get_indexer(); } } print "optimizing and committing...\n"; $indexer->optimize; $indexer->commit; sub get_indexer { Lucy::Index::Indexer ->new( schema => $schema, index => $index, create => 1 ); }

Searcher

Note, it can be used while indexing progresses. Only writes require a lock on the index.

use 5.014; use strictures; use Lucy; use Time::HiRes "gettimeofday", "tv_interval"; use Number::Format "format_number"; my $index = "./lucy.index"; my $searcher = Lucy::Search::IndexSearcher ->new( index => $index ); my $all = $searcher->hits( query => Lucy::Search::MatchAllQuery->new ) +; print "Searching ", format_number($all->total_hits), " records.\n"; print "Query (q to quit): "; while ( my $q = <STDIN> ) { chomp $q; exit if $q =~ /\Aq(uit)?\z/i; my $t0 = [gettimeofday()]; my $hits = $searcher->hits( query => $q, num_wanted => 3 ); printf "\nMatched %s record%s in %1.2f milliseconds\n", format_number($hits->total_hits), $hits->total_hits == 1 ? "" : "s", 1_000 * tv_interval( $t0, [gettimeofday()] ); while ( my $hit = $hits->next ) { printf "%12d -> %s\n", $hit->{id}, $hit->{content}; } print "\nQuery: "; }

Some Sample Output

Some things that this does out of the box and can easily adapt to any prefered style: stemming, non-stemming, logical OR/AND. Compound queries are generally very cheap. Update: I do no compound queries here. That would involve multiple query objects being connected in the searcher.

Searching 126,871,745 records. Query (q to quit): ohai Matched 0 records in 1.33 milliseconds Query: taco Matched 0 records in 0.30 milliseconds Query: dingo Matched 12,498 records in 17.69 milliseconds 79136688 -> incandescency;scratchiness;ungnarred;dingo;desmachymat +ous;verderer 78453332 -> dingo;verderer;incandescency;ungnarred;coinsurance;scr +atchiness;desmachymatous 78367042 -> verderer;ungnarred;incandescency;dingo;desmachymatous; +scratchiness Query: 78311109 Matched 1 record in 80.07 milliseconds 78311109 -> revealing;sulfocarbimide;Darwinize;reproclamation;inte +rmedial;Cinclidae Query: perl Matched 12,511 records in 34.92 milliseconds 78437383 -> unnoticeableness;radiectomy;brogger;rumorer;oreillet;b +efan;perle 59450674 -> perle;Avery;autoxidizability;tidewaiter;radiectomy;fil +thily 59125043 -> oreillet;perle;Avery;autoxidizability;filthily;tidewai +ter;radiectomy Query: pollen OR bee Matched 61,997 records in 27.14 milliseconds 127851379 -> sley;Phalaris;pollen;brasque;snuffle;excalate;operculi +genous 79011524 -> rave;uliginose;gibel;pollened;uncomprised;salve;topogn +osia 78853424 -> topognosia;gibel;rave;uncomprised;pollened;uliginose;s +alve Query: pollen Matched 24,674 records in 1.58 milliseconds 127851379 -> sley;Phalaris;pollen;brasque;snuffle;excalate;operculi +genous 79011524 -> rave;uliginose;gibel;pollened;uncomprised;salve;topogn +osia 78853424 -> topognosia;gibel;rave;uncomprised;pollened;uliginose;s +alve Query: pollen AND bee Matched 0 records in 21.61 milliseconds

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1118835]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (5)
As of 2024-04-19 16:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found