in reply to Using indexing for faster lookup in large file

#!/usr/bin/env perl use 5.014; use strictures; use Lucy; use Time::HiRes "gettimeofday", "tv_interval"; my $index = "./lucy.index"; my $schema = Lucy::Plan::Schema->new; my $easyanalyzer = Lucy::Analysis::EasyAnalyzer ->new( language => 'en' ); my $text_type = Lucy::Plan::FullTextType ->new( analyzer => $easyanalyzer, ); my $string_type = Lucy::Plan::StringType ->new(); $schema->spec_field( name => 'id', type => $string_type ); $schema->spec_field( name => 'content', type => $text_type ); my $indexer = Lucy::Index::Indexer ->new( schema => $schema, index => $index, create => 1, truncate => 1, ); while (<DATA>) { my ( $id1, $id2maybe, $text ) = /\A([0-9]+);(?:([0-9]+);)?(.+)/; for my $id ( grep defined, $id1, $id2maybe ) { $indexer->add_doc({ id => $id, content => $text }); } } $indexer->commit; my $searcher = Lucy::Search::IndexSearcher ->new( index => $index ); print "Query (q to quit): "; while ( my $q = <STDIN> ) { chomp $q; exit if $q =~ /\Aq(uit)?\z/i; my $t0 = [gettimeofday()]; my $hits = $searcher->hits( query => $q, ); while ( my $hit = $hits->next ) { printf "%12d -> %s\n", $hit->{id}, $hit->{content}; } printf "\nMatched %s record%s in %1.1f milliseconds\n", $hits->total_hits, $hits->total_hits == 1 ? "" : "s", 1_000 * tv_interval( $t0, [gettimeofday()] ); print "\nQuery: "; } __DATA__ Your 200 lines of test data…
moo@cow[51]~>perl pm-1118102 Query (q to quit): archaea 259697659 -> root;cellular organisms;Archaea;Euryarchaeota;Thermoco +cci;Thermococcales;Thermococcaceae;Pyrococcus;Pyrococcus abyssi;Pyroc +occus abyssi GE5; 272844 -> root;cellular organisms;Archaea;Euryarchaeota;Thermoco +cci;Thermococcales;Thermococcaceae;Pyrococcus;Pyrococcus abyssi;Pyroc +occus abyssi GE5; 289191770 -> root;cellular organisms;Archaea;Euryarchaeota;Methanoc +occi;Methanococcales;Methanocaldococcaceae;Methanocaldococcus;Methano +caldococcus sp. FS406-22; 644281 -> root;cellular organisms;Archaea;Euryarchaeota;Methanoc +occi;Methanococcales;Methanocaldococcaceae;Methanocaldococcus;Methano +caldococcus sp. FS406-22; 490653205 -> root;cellular organisms;Archaea;Euryarchaeota;Halobact +eria;Halobacteriales;Halobacteriaceae;Haloarcula;Haloarcula vallismor +tis; 28442 -> root;cellular organisms;Archaea;Euryarchaeota;Halobact +eria;Halobacteriales;Halobacteriaceae;Haloarcula;Haloarcula vallismor +tis; 493010542 -> root;cellular organisms;Archaea;Euryarchaeota;Halobact +eria;Halobacteriales;Halobacteriaceae;Natronorubrum;Natronorubrum tib +etense; 63128 -> root;cellular organisms;Archaea;Euryarchaeota;Halobact +eria;Halobacteriales;Halobacteriaceae;Natronorubrum;Natronorubrum tib +etense; 500681908 -> root;cellular organisms;Archaea;Euryarchaeota;Methanoc +occi;Methanococcales;Methanococcaceae;Methanococcus;Methanococcus aeo +licus; 42879 -> root;cellular organisms;Archaea;Euryarchaeota;Methanoc +occi;Methanococcales;Methanococcaceae;Methanococcus;Methanococcus aeo +licus; Matched 12 records in 0.4 milliseconds Query: 283552125 283552125 -> root;Viruses;ssRNA viruses;ssRNA negative-strand virus +es;Orthomyxoviridae;Influenzavirus A;Influenza A virus;H5N1 subtype;I +nfluenza A virus (A/chicken/Nigeria/08RS848-4/2006(H5N1)); Matched 1 record in 0.2 milliseconds

Now… what are you getting me for my birthday? :P

Reading: Lucy (lots of reading to do). I expect this will maintain search speed of a few milliseconds with your full data set. It’s designed to handle millions of much larger and more complex documents. Initial indexing will take awhile but you only have to do it once (script does it every time to make example short/simple). Presentation/splitting of the data content is up to you.

Replies are listed 'Best First'.
Re^2: Using indexing for faster lookup in large file
by anli_ (Novice) on Mar 02, 2015 at 17:41 UTC

    Thanks for a great reply. This was exactly what I was looking for.

    It is quite time consuming to do the indexing, currently is taken > 2 hours (still running), but the lookups seem much faster. Testing it out on a smaller dataset, there is a 3x time reduction vs grep, and I'll have to see how that scales with the full dataset.

    I did some modifications to the code, adapting it to the Lucy::Simple module, so right now it looks like this:

    indexer.pl
    #!/usr/bin/perl use 5.014; use strictures; use Lucy::Simple; my $index = $ARGV[0]; system("mkdir -p $index"); my $lucy = Lucy::Simple->new( path => $index, language => 'en', ); open DATA, '<',$ARGV[1]; while (my $line = <DATA>) { my ($id,$taxid,$text) = split(/;/, $line, 3); $lucy->add_doc( {id => $id, content => $text} ); }

    query.pl:
    #!/usr/bin/perl use 5.014; use strictures; use Lucy::Simple; my $index = Lucy::Simple->new( path => $ARGV[0], language => 'en', ); my $query_string = $ARGV[1]; my $total_hits = $index->search(query => $query_string); #print "Total hits: $total_hits\n"; while ( my $hit = $index->next ) { print "$hit->{id}\t"; print "$hit->{content}"; }

    All the perlmonks XP is your birthday reward :)
Re^2: Using indexing for faster lookup in large file
by erix (Prior) on Mar 04, 2015 at 08:03 UTC

    This Lucy-code is really nice and fast, thanks.

    However, it doesn't work as easy as-is for large files: I let it run for 3 days on a 25 GB file (just the OP-provided 200 lines, repeated) (on an admittedly slowish AMD 8120, 8 GB memory). I started it last sunday, today I had enough and broke it off.

    2015.03.01 09:35:49 aardvark@xxxx:~ $ time ./lucy_big.pl ^C real 4264m3.903s user 4205m27.322s sys 8m5.160s 2015.03.04 08:39:58 aardvark@xxxx:

    There is probably a way to do this with better settings...

    A postgres variant, loading the same full 25 GB file, was rocksolid and searched reasonable well (~20 ms per search, IIRC (had to delete it for diskspace: size in db: 29 GB)).

    Having said that, a pointer-file solution similar to one of the things BrowserUK posted would be my first choice (although I'd likely just use grep -b).

    But undoubtedly I'll be able to use your Lucy code usefully (albeit on smaller files), so thanks.

    I'd like to hear from the OP how he fared with Lucy and his large file...

      Hmmm… Indexing is expensive but it shouldn’t be so long for such a simple data set. Perhaps duplicate data in your autogeneration is harder to segment/index…? Sorry to plunk so much more down without <readmore/> but I hate clicking on them and this is likely the dying ember of the thread. Anyway, here’s Wonderwall.

      Fake Data Maker

      Took 8 minutes to generate 30G “db” that felt like a decent facsimile for the purposes here.

      use 5.014; use strictures; use List::Util "shuffle"; open my $words, "<", "/usr/share/dict/words" or die $!; chomp ( my @words = <$words> ); my $top = @words - 40; @words = shuffle @words; open my $db, ">", "/tmp/PM.db" or die $!; for my $id ( 999_999 .. 999_999_999 ) { use integer; my $end = rand($top); my $range = rand(35) + 5; my $start = $end - $range; $start = 0 if $start < 0; say {$db} join ";", $id, shuffle @words[ $start .. $end ]; last if -s $db > 32_000_000_000; }

      Indexer

      Took 5h:32m to index 30G of 126,871,745 records. This is a relatively powerful Mac. I suspect doing commits less frequently or only at the end would speed it up a bit but you can only search “live” what’s been committed during indexing.

      use 5.014; use strictures; use Lucy; my $index = "./lucy.index"; my $schema = Lucy::Plan::Schema->new; my $easyanalyzer = Lucy::Analysis::EasyAnalyzer ->new( language => 'en' ); my $text_type = Lucy::Plan::FullTextType ->new( analyzer => $easyanalyzer, ); my $string_type = Lucy::Plan::StringType->new(); $schema->spec_field( name => 'id', type => $string_type ); $schema->spec_field( name => 'content', type => $text_type ); open my $db, "<", "/tmp/PM.db" or die $!; my $indexer = get_indexer(); my $counter = 1; while (<$db>) { chomp; my ( $id, $text ) = split /;/, $_, 2; $indexer->add_doc({ id => $id, content => $text }); unless ( $counter++ % 100_000 ) { print "committing a batch...\n"; $indexer->commit; $indexer = get_indexer(); } } print "optimizing and committing...\n"; $indexer->optimize; $indexer->commit; sub get_indexer { Lucy::Index::Indexer ->new( schema => $schema, index => $index, create => 1 ); }

      Searcher

      Note, it can be used while indexing progresses. Only writes require a lock on the index.

      use 5.014; use strictures; use Lucy; use Time::HiRes "gettimeofday", "tv_interval"; use Number::Format "format_number"; my $index = "./lucy.index"; my $searcher = Lucy::Search::IndexSearcher ->new( index => $index ); my $all = $searcher->hits( query => Lucy::Search::MatchAllQuery->new ) +; print "Searching ", format_number($all->total_hits), " records.\n"; print "Query (q to quit): "; while ( my $q = <STDIN> ) { chomp $q; exit if $q =~ /\Aq(uit)?\z/i; my $t0 = [gettimeofday()]; my $hits = $searcher->hits( query => $q, num_wanted => 3 ); printf "\nMatched %s record%s in %1.2f milliseconds\n", format_number($hits->total_hits), $hits->total_hits == 1 ? "" : "s", 1_000 * tv_interval( $t0, [gettimeofday()] ); while ( my $hit = $hits->next ) { printf "%12d -> %s\n", $hit->{id}, $hit->{content}; } print "\nQuery: "; }

      Some Sample Output

      Some things that this does out of the box and can easily adapt to any prefered style: stemming, non-stemming, logical OR/AND. Compound queries are generally very cheap. Update: I do no compound queries here. That would involve multiple query objects being connected in the searcher.

      Searching 126,871,745 records. Query (q to quit): ohai Matched 0 records in 1.33 milliseconds Query: taco Matched 0 records in 0.30 milliseconds Query: dingo Matched 12,498 records in 17.69 milliseconds 79136688 -> incandescency;scratchiness;ungnarred;dingo;desmachymat +ous;verderer 78453332 -> dingo;verderer;incandescency;ungnarred;coinsurance;scr +atchiness;desmachymatous 78367042 -> verderer;ungnarred;incandescency;dingo;desmachymatous; +scratchiness Query: 78311109 Matched 1 record in 80.07 milliseconds 78311109 -> revealing;sulfocarbimide;Darwinize;reproclamation;inte +rmedial;Cinclidae Query: perl Matched 12,511 records in 34.92 milliseconds 78437383 -> unnoticeableness;radiectomy;brogger;rumorer;oreillet;b +efan;perle 59450674 -> perle;Avery;autoxidizability;tidewaiter;radiectomy;fil +thily 59125043 -> oreillet;perle;Avery;autoxidizability;filthily;tidewai +ter;radiectomy Query: pollen OR bee Matched 61,997 records in 27.14 milliseconds 127851379 -> sley;Phalaris;pollen;brasque;snuffle;excalate;operculi +genous 79011524 -> rave;uliginose;gibel;pollened;uncomprised;salve;topogn +osia 78853424 -> topognosia;gibel;rave;uncomprised;pollened;uliginose;s +alve Query: pollen Matched 24,674 records in 1.58 milliseconds 127851379 -> sley;Phalaris;pollen;brasque;snuffle;excalate;operculi +genous 79011524 -> rave;uliginose;gibel;pollened;uncomprised;salve;topogn +osia 78853424 -> topognosia;gibel;rave;uncomprised;pollened;uliginose;s +alve Query: pollen AND bee Matched 0 records in 21.61 milliseconds
      (on an admittedly slowish AMD 8120, 8 GB memory)

      I'll trade you your 3.4/4.0 GHz 8-thread processor for my 2.4Ghz 4 core if you like :)

      If you haven't thrown away that 25GB file and can spare your processor for an hour, I'd love to see a like-for-like comparison of my code in Re: Using indexing for faster lookup in large file (PP < 0.0005s/record).


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

        The 25GB file that I used isn't sorted. Your indexer program indexes it in 7 minutes but subsequently the searcher cannot find anything in it (reports 'Not found' 1000 times).

        I've also tried with the file ordered but initially I could not get it to find anything. It turns out the regex used in your indexer [^(\d+),] did not match anything in the file. When I fixed that (it had to be [^(\d+);] for the OP's lines), the results were as follows (both measurements are very repeatable).

        Files:

        $ ls -lh hominae.txt hominae_renumbered.* -rw-rw-r--. 1 aardvark aardvark 1.5G Mar 4 13:48 hominae_renumbered.i +dx -rw-r--r--. 1 aardvark aardvark 25G Mar 4 13:17 hominae_renumbered.t +xt -rw-rw-r--. 1 aardvark aardvark 25G Feb 28 01:41 hominae.txt

        hominae.txt is the 25GB file which I made by repeating the OP's 200 lines.

        hominae_renumbered.txt is the same file but with the initial numbers replaced by 1 to 13M (so it is ordered).

        Timing, your pointer file:

        $ perl browseruk2_searcher.pl \ hominae_renumbered.txt \ hominae_renumbered.idx > bukrun; tail -n1 bukrun 'Lookup averaged 0.012486 seconds/record

        Timing, database search:

        # I took a join to 1000 random numbers as equivalent to 1000 searches: # table hm is the table with the 25GB data loaded into it $ echo "select * from (select (random()*131899400)::int from generate_series(1,1000)) a +s r(n) join hm on r.n = hm.id;" | psql -q | tail -n 1 Time: 19555.717 ms

        So your pointer file is faster but only by a small margin ( I thought it was small, anyway; I had expected a much larger difference (of course, with the db always being the slower contender)).

        Your indexing was faster too: it took only ~7 minutes to create. I forgot to time the db load but that was in the region of half an hour (could have been speeded up a bit by doing import and index separately).

        Just for the record, here is also the db load:

        time < hominae.txt perl -ne ' chomp; my @arr = split(/;/, $_, 2); print $arr[1], "\n"; ' \ | psql -c " drop table if exists hm; create table if not exists hm (line text, id serial primary key); copy hm (line) from stdin with (format csv, delimiter E'\t', head +er FALSE); "; testdb=# \dti+ hm* List of relations Schema | Name | Type | Owner | Table | Size --------+---------+-------+----------+-------+--------- public | hm | table | aardvark | | 29 GB public | hm_pkey | index | aardvark | hm | 2825 MB (2 rows)