comment on

The 25GB file that I used isn't sorted. Your indexer program indexes it in 7 minutes but subsequently the searcher cannot find anything in it (reports 'Not found' 1000 times).

I've also tried with the file ordered but initially I could not get it to find anything. It turns out the regex used in your indexer [^(\d+),] did not match anything in the file. When I fixed that (it had to be [^(\d+);] for the OP's lines), the results were as follows (both measurements are very repeatable).

Files:

$ ls -lh hominae.txt hominae_renumbered.*
-rw-rw-r--. 1 aardvark aardvark 1.5G Mar  4 13:48 hominae_renumbered.i
+dx
-rw-r--r--. 1 aardvark aardvark  25G Mar  4 13:17 hominae_renumbered.t
+xt
-rw-rw-r--. 1 aardvark aardvark  25G Feb 28 01:41 hominae.txt
[download]

hominae.txt is the 25GB file which I made by repeating the OP's 200 lines.

hominae_renumbered.txt is the same file but with the initial numbers replaced by 1 to 13M (so it is ordered).

Timing, your pointer file:

$ perl browseruk2_searcher.pl \
  hominae_renumbered.txt \
  hominae_renumbered.idx > bukrun;
tail -n1 bukrun
'Lookup averaged 0.012486 seconds/record
[download]

Timing, database search:

# I took a join to 1000 random numbers as equivalent to 1000 searches:
# table hm is the table with the 25GB data loaded into it
$ echo "select *
from (select (random()*131899400)::int from generate_series(1,1000)) a
+s r(n)
join hm on r.n = hm.id;" | psql -q | tail -n 1
Time: 19555.717 ms
[download]

So your pointer file is faster but only by a small margin ( I thought it was small, anyway; I had expected a much larger difference (of course, with the db always being the slower contender)).

Your indexing was faster too: it took only ~7 minutes to create. I forgot to time the db load but that was in the region of half an hour (could have been speeded up a bit by doing import and index separately).

Just for the record, here is also the db load:

time < hominae.txt perl -ne '
  chomp;
  my @arr = split(/;/, $_, 2);
  print $arr[1], "\n";
' \
  | psql -c "
     drop table if exists hm;
     create table if not exists hm (line text, id serial primary key);
     copy hm (line) from stdin with (format csv, delimiter E'\t', head
+er FALSE);
   ";

testdb=# \dti+ hm*
                          List of relations
 Schema |  Name   | Type  |  Owner   | Table |  Size   
--------+---------+-------+----------+-------+---------
 public | hm      | table | aardvark |       | 29 GB   
 public | hm_pkey | index | aardvark | hm    | 2825 MB
(2 rows)
[download]

In reply to Re^4: Using indexing for faster lookup in large file by erix
in thread Using indexing for faster lookup in large file by anli_

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.