in reply to Re: Using indexing for faster lookup in large file
in thread Using indexing for faster lookup in large file

This Lucy-code is really nice and fast, thanks.

However, it doesn't work as easy as-is for large files: I let it run for 3 days on a 25 GB file (just the OP-provided 200 lines, repeated) (on an admittedly slowish AMD 8120, 8 GB memory). I started it last sunday, today I had enough and broke it off.

2015.03.01 09:35:49 aardvark@xxxx:~ $ time ./lucy_big.pl ^C real 4264m3.903s user 4205m27.322s sys 8m5.160s 2015.03.04 08:39:58 aardvark@xxxx:

There is probably a way to do this with better settings...

A postgres variant, loading the same full 25 GB file, was rocksolid and searched reasonable well (~20 ms per search, IIRC (had to delete it for diskspace: size in db: 29 GB)).

Having said that, a pointer-file solution similar to one of the things BrowserUK posted would be my first choice (although I'd likely just use grep -b).

But undoubtedly I'll be able to use your Lucy code usefully (albeit on smaller files), so thanks.

I'd like to hear from the OP how he fared with Lucy and his large file...

Replies are listed 'Best First'.
Re^3: Using indexing for faster lookup in large file
by Your Mother (Archbishop) on Mar 05, 2015 at 04:37 UTC

    Hmmm… Indexing is expensive but it shouldn’t be so long for such a simple data set. Perhaps duplicate data in your autogeneration is harder to segment/index…? Sorry to plunk so much more down without <readmore/> but I hate clicking on them and this is likely the dying ember of the thread. Anyway, here’s Wonderwall.

    Fake Data Maker

    Took 8 minutes to generate 30G “db” that felt like a decent facsimile for the purposes here.

    use 5.014; use strictures; use List::Util "shuffle"; open my $words, "<", "/usr/share/dict/words" or die $!; chomp ( my @words = <$words> ); my $top = @words - 40; @words = shuffle @words; open my $db, ">", "/tmp/PM.db" or die $!; for my $id ( 999_999 .. 999_999_999 ) { use integer; my $end = rand($top); my $range = rand(35) + 5; my $start = $end - $range; $start = 0 if $start < 0; say {$db} join ";", $id, shuffle @words[ $start .. $end ]; last if -s $db > 32_000_000_000; }

    Indexer

    Took 5h:32m to index 30G of 126,871,745 records. This is a relatively powerful Mac. I suspect doing commits less frequently or only at the end would speed it up a bit but you can only search “live” what’s been committed during indexing.

    use 5.014; use strictures; use Lucy; my $index = "./lucy.index"; my $schema = Lucy::Plan::Schema->new; my $easyanalyzer = Lucy::Analysis::EasyAnalyzer ->new( language => 'en' ); my $text_type = Lucy::Plan::FullTextType ->new( analyzer => $easyanalyzer, ); my $string_type = Lucy::Plan::StringType->new(); $schema->spec_field( name => 'id', type => $string_type ); $schema->spec_field( name => 'content', type => $text_type ); open my $db, "<", "/tmp/PM.db" or die $!; my $indexer = get_indexer(); my $counter = 1; while (<$db>) { chomp; my ( $id, $text ) = split /;/, $_, 2; $indexer->add_doc({ id => $id, content => $text }); unless ( $counter++ % 100_000 ) { print "committing a batch...\n"; $indexer->commit; $indexer = get_indexer(); } } print "optimizing and committing...\n"; $indexer->optimize; $indexer->commit; sub get_indexer { Lucy::Index::Indexer ->new( schema => $schema, index => $index, create => 1 ); }

    Searcher

    Note, it can be used while indexing progresses. Only writes require a lock on the index.

    use 5.014; use strictures; use Lucy; use Time::HiRes "gettimeofday", "tv_interval"; use Number::Format "format_number"; my $index = "./lucy.index"; my $searcher = Lucy::Search::IndexSearcher ->new( index => $index ); my $all = $searcher->hits( query => Lucy::Search::MatchAllQuery->new ) +; print "Searching ", format_number($all->total_hits), " records.\n"; print "Query (q to quit): "; while ( my $q = <STDIN> ) { chomp $q; exit if $q =~ /\Aq(uit)?\z/i; my $t0 = [gettimeofday()]; my $hits = $searcher->hits( query => $q, num_wanted => 3 ); printf "\nMatched %s record%s in %1.2f milliseconds\n", format_number($hits->total_hits), $hits->total_hits == 1 ? "" : "s", 1_000 * tv_interval( $t0, [gettimeofday()] ); while ( my $hit = $hits->next ) { printf "%12d -> %s\n", $hit->{id}, $hit->{content}; } print "\nQuery: "; }

    Some Sample Output

    Some things that this does out of the box and can easily adapt to any prefered style: stemming, non-stemming, logical OR/AND. Compound queries are generally very cheap. Update: I do no compound queries here. That would involve multiple query objects being connected in the searcher.

    Searching 126,871,745 records. Query (q to quit): ohai Matched 0 records in 1.33 milliseconds Query: taco Matched 0 records in 0.30 milliseconds Query: dingo Matched 12,498 records in 17.69 milliseconds 79136688 -> incandescency;scratchiness;ungnarred;dingo;desmachymat +ous;verderer 78453332 -> dingo;verderer;incandescency;ungnarred;coinsurance;scr +atchiness;desmachymatous 78367042 -> verderer;ungnarred;incandescency;dingo;desmachymatous; +scratchiness Query: 78311109 Matched 1 record in 80.07 milliseconds 78311109 -> revealing;sulfocarbimide;Darwinize;reproclamation;inte +rmedial;Cinclidae Query: perl Matched 12,511 records in 34.92 milliseconds 78437383 -> unnoticeableness;radiectomy;brogger;rumorer;oreillet;b +efan;perle 59450674 -> perle;Avery;autoxidizability;tidewaiter;radiectomy;fil +thily 59125043 -> oreillet;perle;Avery;autoxidizability;filthily;tidewai +ter;radiectomy Query: pollen OR bee Matched 61,997 records in 27.14 milliseconds 127851379 -> sley;Phalaris;pollen;brasque;snuffle;excalate;operculi +genous 79011524 -> rave;uliginose;gibel;pollened;uncomprised;salve;topogn +osia 78853424 -> topognosia;gibel;rave;uncomprised;pollened;uliginose;s +alve Query: pollen Matched 24,674 records in 1.58 milliseconds 127851379 -> sley;Phalaris;pollen;brasque;snuffle;excalate;operculi +genous 79011524 -> rave;uliginose;gibel;pollened;uncomprised;salve;topogn +osia 78853424 -> topognosia;gibel;rave;uncomprised;pollened;uliginose;s +alve Query: pollen AND bee Matched 0 records in 21.61 milliseconds
Re^3: Using indexing for faster lookup in large file
by BrowserUk (Patriarch) on Mar 04, 2015 at 08:51 UTC
    (on an admittedly slowish AMD 8120, 8 GB memory)

    I'll trade you your 3.4/4.0 GHz 8-thread processor for my 2.4Ghz 4 core if you like :)

    If you haven't thrown away that 25GB file and can spare your processor for an hour, I'd love to see a like-for-like comparison of my code in Re: Using indexing for faster lookup in large file (PP < 0.0005s/record).


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

      The 25GB file that I used isn't sorted. Your indexer program indexes it in 7 minutes but subsequently the searcher cannot find anything in it (reports 'Not found' 1000 times).

      I've also tried with the file ordered but initially I could not get it to find anything. It turns out the regex used in your indexer [^(\d+),] did not match anything in the file. When I fixed that (it had to be [^(\d+);] for the OP's lines), the results were as follows (both measurements are very repeatable).

      Files:

      $ ls -lh hominae.txt hominae_renumbered.* -rw-rw-r--. 1 aardvark aardvark 1.5G Mar 4 13:48 hominae_renumbered.i +dx -rw-r--r--. 1 aardvark aardvark 25G Mar 4 13:17 hominae_renumbered.t +xt -rw-rw-r--. 1 aardvark aardvark 25G Feb 28 01:41 hominae.txt

      hominae.txt is the 25GB file which I made by repeating the OP's 200 lines.

      hominae_renumbered.txt is the same file but with the initial numbers replaced by 1 to 13M (so it is ordered).

      Timing, your pointer file:

      $ perl browseruk2_searcher.pl \ hominae_renumbered.txt \ hominae_renumbered.idx > bukrun; tail -n1 bukrun 'Lookup averaged 0.012486 seconds/record

      Timing, database search:

      # I took a join to 1000 random numbers as equivalent to 1000 searches: # table hm is the table with the 25GB data loaded into it $ echo "select * from (select (random()*131899400)::int from generate_series(1,1000)) a +s r(n) join hm on r.n = hm.id;" | psql -q | tail -n 1 Time: 19555.717 ms

      So your pointer file is faster but only by a small margin ( I thought it was small, anyway; I had expected a much larger difference (of course, with the db always being the slower contender)).

      Your indexing was faster too: it took only ~7 minutes to create. I forgot to time the db load but that was in the region of half an hour (could have been speeded up a bit by doing import and index separately).

      Just for the record, here is also the db load:

      time < hominae.txt perl -ne ' chomp; my @arr = split(/;/, $_, 2); print $arr[1], "\n"; ' \ | psql -c " drop table if exists hm; create table if not exists hm (line text, id serial primary key); copy hm (line) from stdin with (format csv, delimiter E'\t', head +er FALSE); "; testdb=# \dti+ hm* List of relations Schema | Name | Type | Owner | Table | Size --------+---------+-------+----------+-------+--------- public | hm | table | aardvark | | 29 GB public | hm_pkey | index | aardvark | hm | 2825 MB (2 rows)

        Thanks for doing that erix.

        'Lookup averaged 0.012486 seconds/record

        Hm. Disappointed with that. I suspect a good deal of that time is down to writing the 1000 found records to the disk.

        I suspect that if you commented out the print of the records and reran it, it'd be more in line with the numbers I get here:

        for my $i ( 1 .. $N ) { my $rndRec = 1 + int rand( 160e6 ); # printf "Record $rndRec: "; my $pos = binsearch( \$idx, $rndRec ); if( $pos ) { seek DATA, $pos, 0; # printf "'%s'", scalar <DATA>; }

        The first number is the time taken to load the index. The second run is with a warm cache:

        E:\>c:\test\1118102-searcher e:30GB.dat e:30GB.idx 16.8919820785522 Lookup averaged 0.009681803 seconds/record E:\>c:\test\1118102-searcher e:30GB.dat e:30GB.idx 4.17907309532166 Lookup averaged 0.009416031 seconds/record

        Of course, if I run it on an SSD, it looks much nicer, especially as the cache warms up:

        E:\>c:\test\1118102-searcher s:30GB.dat s:30GB.idx 33.1236040592194 Lookup averaged 0.000902344 seconds/record E:\>c:\test\1118102-searcher s:30GB.dat s:30GB.idx 3.44389009475708 Lookup averaged 0.000789429 seconds/record E:\>c:\test\1118102-searcher s:30GB.dat s:30GB.idx 4.35790991783142 Lookup averaged 0.000551061 seconds/record E:\>c:\test\1118102-searcher s:30GB.dat s:30GB.idx 3.86181402206421 Lookup averaged 0.000482989 seconds/record E:\>c:\test\1118102-searcher s:30GB.dat s:30GB.idx 4.66845011711121 Lookup averaged 0.000458750 seconds/record

        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked