Re: Using indexing for faster lookup in large file

#!/usr/bin/env perl
use 5.014;
use strictures;
use Lucy;
use Time::HiRes "gettimeofday", "tv_interval";

my $index = "./lucy.index";

my $schema = Lucy::Plan::Schema->new;
my $easyanalyzer = Lucy::Analysis::EasyAnalyzer
    ->new( language => 'en' );
my $text_type = Lucy::Plan::FullTextType
    ->new( analyzer => $easyanalyzer, );

my $string_type = Lucy::Plan::StringType
    ->new();

$schema->spec_field( name => 'id',   type => $string_type );
$schema->spec_field( name => 'content', type => $text_type );

my $indexer = Lucy::Index::Indexer
    ->new( schema => $schema,   
           index  => $index,
           create => 1,
           truncate => 1, );

while (<DATA>)
{
    my ( $id1, $id2maybe, $text ) = /\A([0-9]+);(?:([0-9]+);)?(.+)/;
    for my $id ( grep defined, $id1, $id2maybe )
    {
        $indexer->add_doc({ id => $id,
                            content => $text });
    }
}
$indexer->commit;

my $searcher = Lucy::Search::IndexSearcher
    ->new( index => $index );

print "Query (q to quit): ";
while ( my $q = <STDIN> )
{
    chomp $q;
    exit if $q =~ /\Aq(uit)?\z/i;
    my $t0 = [gettimeofday()];
    my $hits = $searcher->hits( query => $q, );
    while ( my $hit = $hits->next )
    {
        printf "%12d -> %s\n", $hit->{id}, $hit->{content};
    }

    printf "\nMatched %s record%s in %1.1f milliseconds\n",
        $hits->total_hits,
        $hits->total_hits == 1 ? "" : "s",
        1_000 * tv_interval( $t0, [gettimeofday()] );

    print "\nQuery: ";
}

__DATA__
Your 200 lines of test data…
[download]

moo@cow[51]~>perl pm-1118102
Query (q to quit): archaea
   259697659 -> root;cellular organisms;Archaea;Euryarchaeota;Thermoco
+cci;Thermococcales;Thermococcaceae;Pyrococcus;Pyrococcus abyssi;Pyroc
+occus abyssi GE5;
      272844 -> root;cellular organisms;Archaea;Euryarchaeota;Thermoco
+cci;Thermococcales;Thermococcaceae;Pyrococcus;Pyrococcus abyssi;Pyroc
+occus abyssi GE5;
   289191770 -> root;cellular organisms;Archaea;Euryarchaeota;Methanoc
+occi;Methanococcales;Methanocaldococcaceae;Methanocaldococcus;Methano
+caldococcus sp. FS406-22;
      644281 -> root;cellular organisms;Archaea;Euryarchaeota;Methanoc
+occi;Methanococcales;Methanocaldococcaceae;Methanocaldococcus;Methano
+caldococcus sp. FS406-22;
   490653205 -> root;cellular organisms;Archaea;Euryarchaeota;Halobact
+eria;Halobacteriales;Halobacteriaceae;Haloarcula;Haloarcula vallismor
+tis;
       28442 -> root;cellular organisms;Archaea;Euryarchaeota;Halobact
+eria;Halobacteriales;Halobacteriaceae;Haloarcula;Haloarcula vallismor
+tis;
   493010542 -> root;cellular organisms;Archaea;Euryarchaeota;Halobact
+eria;Halobacteriales;Halobacteriaceae;Natronorubrum;Natronorubrum tib
+etense;
       63128 -> root;cellular organisms;Archaea;Euryarchaeota;Halobact
+eria;Halobacteriales;Halobacteriaceae;Natronorubrum;Natronorubrum tib
+etense;
   500681908 -> root;cellular organisms;Archaea;Euryarchaeota;Methanoc
+occi;Methanococcales;Methanococcaceae;Methanococcus;Methanococcus aeo
+licus;
       42879 -> root;cellular organisms;Archaea;Euryarchaeota;Methanoc
+occi;Methanococcales;Methanococcaceae;Methanococcus;Methanococcus aeo
+licus;

Matched 12 records in 0.4 milliseconds

Query: 283552125
   283552125 -> root;Viruses;ssRNA viruses;ssRNA negative-strand virus
+es;Orthomyxoviridae;Influenzavirus A;Influenza A virus;H5N1 subtype;I
+nfluenza A virus (A/chicken/Nigeria/08RS848-4/2006(H5N1));

Matched 1 record in 0.2 milliseconds
[download]

Now… what are you getting me for my birthday? :P

Reading: Lucy (lots of reading to do). I expect this will maintain search speed of a few milliseconds with your full data set. It’s designed to handle millions of much larger and more complex documents. Initial indexing will take awhile but you only have to do it once (script does it every time to make example short/simple). Presentation/splitting of the data content is up to you.

Comment on Re: Using indexing for faster lookup in large file Select or Download Code

Replies are listed 'Best First'.
Re^2: Using indexing for faster lookup in large file by anli_ (Novice) on Mar 02, 2015 at 17:41 UTC
Thanks for a great reply. This was exactly what I was looking for. It is quite time consuming to do the indexing, currently is taken > 2 hours (still running), but the lookups seem much faster. Testing it out on a smaller dataset, there is a 3x time reduction vs grep, and I'll have to see how that scales with the full dataset. I did some modifications to the code, adapting it to the Lucy::Simple module, so right now it looks like this: indexer.pl `#!/usr/bin/perl use 5.014; use strictures; use Lucy::Simple; my $index = $ARGV[0]; system("mkdir -p $index"); my $lucy = Lucy::Simple->new( path => $index, language => 'en', ); open DATA, '<',$ARGV[1]; while (my $line = <DATA>) { my ($id,$taxid,$text) = split(/;/, $line, 3); $lucy->add_doc( {id => $id, content => $text} ); }` [download] query.pl: `#!/usr/bin/perl use 5.014; use strictures; use Lucy::Simple; my $index = Lucy::Simple->new( path => $ARGV[0], language => 'en', ); my $query_string = $ARGV[1]; my $total_hits = $index->search(query => $query_string); #print "Total hits: $total_hits\n"; while ( my $hit = $index->next ) { print "$hit->{id}\t"; print "$hit->{content}"; }` [download] All the perlmonks XP is your birthday reward :)	[reply] [d/l] [select]
Re^2: Using indexing for faster lookup in large file by erix (Prior) on Mar 04, 2015 at 08:03 UTC
This Lucy-code is really nice and fast, thanks. However, it doesn't work as easy as-is for large files: I let it run for 3 days on a 25 GB file (just the OP-provided 200 lines, repeated) (on an admittedly slowish AMD 8120, 8 GB memory). I started it last sunday, today I had enough and broke it off. `2015.03.01 09:35:49 aardvark@xxxx:~ $ time ./lucy_big.pl ^C real 4264m3.903s user 4205m27.322s sys 8m5.160s 2015.03.04 08:39:58 aardvark@xxxx:` [download] There is probably a way to do this with better settings... A postgres variant, loading the same full 25 GB file, was rocksolid and searched reasonable well (~20 ms per search, IIRC (had to delete it for diskspace: size in db: 29 GB)). Having said that, a pointer-file solution similar to one of the things BrowserUK posted would be my first choice (although I'd likely just use `grep -b`). But undoubtedly I'll be able to use your Lucy code usefully (albeit on smaller files), so thanks. I'd like to hear from the OP how he fared with Lucy and his large file...	[reply] [d/l] [select]
Re^3: Using indexing for faster lookup in large file by Your Mother (Archbishop) on Mar 05, 2015 at 04:37 UTC
Hmmm… Indexing is expensive but it shouldn’t be so long for such a simple data set. Perhaps duplicate data in your autogeneration is harder to segment/index…? Sorry to plunk so much more down without `<readmore/>` but I hate clicking on them and this is likely the dying ember of the thread. Anyway, here’s Wonderwall. Fake Data Maker Took 8 minutes to generate 30G “db” that felt like a decent facsimile for the purposes here. `use 5.014; use strictures; use List::Util "shuffle"; open my $words, "<", "/usr/share/dict/words" or die $!; chomp ( my @words = <$words> ); my $top = @words - 40; @words = shuffle @words; open my $db, ">", "/tmp/PM.db" or die $!; for my $id ( 999_999 .. 999_999_999 ) { use integer; my $end = rand($top); my $range = rand(35) + 5; my $start = $end - $range; $start = 0 if $start < 0; say {$db} join ";", $id, shuffle @words[ $start .. $end ]; last if -s $db > 32_000_000_000; }` [download] Indexer Took 5h:32m to index 30G of 126,871,745 records. This is a relatively powerful Mac. I suspect doing commits less frequently or only at the end would speed it up a bit but you can only search “live” what’s been committed during indexing. use 5.014; use strictures; use Lucy; my $index = "./lucy.index"; my $schema = Lucy::Plan::Schema->new; my $easyanalyzer = Lucy::Analysis::EasyAnalyzer ->new( language => 'en' ); my $text_type = Lucy::Plan::FullTextType ->new( analyzer => $easyanalyzer, ); my $string_type = Lucy::Plan::StringType->new(); $schema->spec_field( name => 'id', type => $string_type ); $schema->spec_field( name => 'content', type => $text_type ); open my $db, "<", "/tmp/PM.db" or die $!; my $indexer = get_indexer(); my $counter = 1; while (<$db>) { chomp; my ( $id, $text ) = split /;/, $_, 2; $indexer->add_doc({ id => $id, content => $text }); unless ( $counter++ % 100_000 ) { print "committing a batch...\n"; $indexer->commit; $indexer = get_indexer(); } } print "optimizing and committing...\n"; $indexer->optimize; $indexer->commit; sub get_indexer { Lucy::Index::Indexer ->new( schema => $schema, index => $index, create => 1 ); } [download] Searcher Note, it can be used while indexing progresses. Only writes require a lock on the index. use 5.014; use strictures; use Lucy; use Time::HiRes "gettimeofday", "tv_interval"; use Number::Format "format_number"; my $index = "./lucy.index"; my $searcher = Lucy::Search::IndexSearcher ->new( index => $index ); my $all = $searcher->hits( query => Lucy::Search::MatchAllQuery->new ) +; print "Searching ", format_number($all->total_hits), " records.\n"; print "Query (q to quit): "; while ( my $q = <STDIN> ) { chomp $q; exit if $q =~ /\Aq(uit)?\z/i; my $t0 = [gettimeofday()]; my $hits = $searcher->hits( query => $q, num_wanted => 3 ); printf "\nMatched %s record%s in %1.2f milliseconds\n", format_number($hits->total_hits), $hits->total_hits == 1 ? "" : "s", 1_000 * tv_interval( $t0, [gettimeofday()] ); while ( my $hit = $hits->next ) { printf "%12d -> %s\n", $hit->{id}, $hit->{content}; } print "\nQuery: "; } [download] Some Sample Output Some things that this does out of the box and can easily adapt to any prefered style: stemming, non-stemming, logical OR/AND. Compound queries are generally very cheap. Update: I do no compound queries here. That would involve multiple query objects being connected in the searcher. Searching 126,871,745 records. Query (q to quit): ohai Matched 0 records in 1.33 milliseconds Query: taco Matched 0 records in 0.30 milliseconds Query: dingo Matched 12,498 records in 17.69 milliseconds 79136688 -> incandescency;scratchiness;ungnarred;dingo;desmachymat +ous;verderer 78453332 -> dingo;verderer;incandescency;ungnarred;coinsurance;scr +atchiness;desmachymatous 78367042 -> verderer;ungnarred;incandescency;dingo;desmachymatous; +scratchiness Query: 78311109 Matched 1 record in 80.07 milliseconds 78311109 -> revealing;sulfocarbimide;Darwinize;reproclamation;inte +rmedial;Cinclidae Query: perl Matched 12,511 records in 34.92 milliseconds 78437383 -> unnoticeableness;radiectomy;brogger;rumorer;oreillet;b +efan;perle 59450674 -> perle;Avery;autoxidizability;tidewaiter;radiectomy;fil +thily 59125043 -> oreillet;perle;Avery;autoxidizability;filthily;tidewai +ter;radiectomy Query: pollen OR bee Matched 61,997 records in 27.14 milliseconds 127851379 -> sley;Phalaris;pollen;brasque;snuffle;excalate;operculi +genous 79011524 -> rave;uliginose;gibel;pollened;uncomprised;salve;topogn +osia 78853424 -> topognosia;gibel;rave;uncomprised;pollened;uliginose;s +alve Query: pollen Matched 24,674 records in 1.58 milliseconds 127851379 -> sley;Phalaris;pollen;brasque;snuffle;excalate;operculi +genous 79011524 -> rave;uliginose;gibel;pollened;uncomprised;salve;topogn +osia 78853424 -> topognosia;gibel;rave;uncomprised;pollened;uliginose;s +alve Query: pollen AND bee Matched 0 records in 21.61 milliseconds [download]	[reply] [d/l] [select]
Re^3: Using indexing for faster lookup in large file by BrowserUk (Patriarch) on Mar 04, 2015 at 08:51 UTC
(on an admittedly slowish AMD 8120, 8 GB memory) I'll trade you your 3.4/4.0 GHz 8-thread processor for my 2.4Ghz 4 core if you like :) If you haven't thrown away that 25GB file and can spare your processor for an hour, I'd love to see a like-for-like comparison of my code in Re: Using indexing for faster lookup in large file (PP < 0.0005s/record). With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply]
Re^4: Using indexing for faster lookup in large file by erix (Prior) on Mar 04, 2015 at 13:26 UTC
The 25GB file that I used isn't sorted. Your indexer program indexes it in 7 minutes but subsequently the searcher cannot find anything in it (reports 'Not found' 1000 times). I've also tried with the file ordered but initially I could not get it to find anything. It turns out the regex used in your indexer `[^(\d+),]` did not match anything in the file. When I fixed that (it had to be `[^(\d+);]` for the OP's lines), the results were as follows (both measurements are very repeatable). Files: `$ ls -lh hominae.txt hominae_renumbered.* -rw-rw-r--. 1 aardvark aardvark 1.5G Mar 4 13:48 hominae_renumbered.i +dx -rw-r--r--. 1 aardvark aardvark 25G Mar 4 13:17 hominae_renumbered.t +xt -rw-rw-r--. 1 aardvark aardvark 25G Feb 28 01:41 hominae.txt` [download] hominae.txt is the 25GB file which I made by repeating the OP's 200 lines. hominae_renumbered.txt is the same file but with the initial numbers replaced by 1 to 13M (so it is ordered). Timing, your pointer file: `$ perl browseruk2_searcher.pl \ hominae_renumbered.txt \ hominae_renumbered.idx > bukrun; tail -n1 bukrun 'Lookup averaged 0.012486 seconds/record` [download] Timing, database search: `# I took a join to 1000 random numbers as equivalent to 1000 searches: # table hm is the table with the 25GB data loaded into it $ echo "select * from (select (random()131899400)::int from generate_series(1,1000)) a +s r(n) join hm on r.n = hm.id;" \| psql -q \| tail -n 1 Time: 19555.717 ms` [download] So your pointer file is faster but only by a small margin ( I thought it was small, anyway; I had expected a much larger difference (of course, with the db always being the slower contender)). Your indexing was faster too: it took only ~7 minutes to create. I forgot to time the db load but that was in the region of half an hour (could have been speeded up a bit by doing import and index separately). Just for the record, here is also the db load: time < hominae.txt perl -ne ' chomp; my @arr = split(/;/, $_, 2); print $arr[1], "\n"; ' \ \| psql -c " drop table if exists hm; create table if not exists hm (line text, id serial primary key); copy hm (line) from stdin with (format csv, delimiter E'\t', head +er FALSE); "; testdb=# \dti+ hm List of relations Schema \| Name \| Type \| Owner \| Table \| Size --------+---------+-------+----------+-------+--------- public \| hm \| table \| aardvark \| \| 29 GB public \| hm_pkey \| index \| aardvark \| hm \| 2825 MB (2 rows) [download]	[reply] [d/l] [select]
Re^5: Using indexing for faster lookup in large file by BrowserUk (Patriarch) on Mar 04, 2015 at 14:10 UTC
Re^6: Using indexing for faster lookup in large file by erix (Prior) on Mar 04, 2015 at 14:16 UTC
Some notes below your chosen depth have not been shown here

Fake Data Maker

Indexer

Searcher

Some Sample Output