The 25GB file that I used isn't sorted. Your indexer program indexes it in 7 minutes but subsequently the searcher cannot find anything in it (reports 'Not found' 1000 times).
I've also tried with the file ordered but initially I could not get it to find anything. It turns out the regex used in your indexer [^(\d+),] did not match anything in the file. When I fixed that (it had to be [^(\d+);] for the OP's lines), the results were as follows (both measurements are very repeatable).
Files:
$ ls -lh hominae.txt hominae_renumbered.* -rw-rw-r--. 1 aardvark aardvark 1.5G Mar 4 13:48 hominae_renumbered.i +dx -rw-r--r--. 1 aardvark aardvark 25G Mar 4 13:17 hominae_renumbered.t +xt -rw-rw-r--. 1 aardvark aardvark 25G Feb 28 01:41 hominae.txt
hominae.txt is the 25GB file which I made by repeating the OP's 200 lines.
hominae_renumbered.txt is the same file but with the initial numbers replaced by 1 to 13M (so it is ordered).
Timing, your pointer file:
$ perl browseruk2_searcher.pl \ hominae_renumbered.txt \ hominae_renumbered.idx > bukrun; tail -n1 bukrun 'Lookup averaged 0.012486 seconds/record
Timing, database search:
# I took a join to 1000 random numbers as equivalent to 1000 searches: # table hm is the table with the 25GB data loaded into it $ echo "select * from (select (random()*131899400)::int from generate_series(1,1000)) a +s r(n) join hm on r.n = hm.id;" | psql -q | tail -n 1 Time: 19555.717 ms
So your pointer file is faster but only by a small margin ( I thought it was small, anyway; I had expected a much larger difference (of course, with the db always being the slower contender)).
Your indexing was faster too: it took only ~7 minutes to create. I forgot to time the db load but that was in the region of half an hour (could have been speeded up a bit by doing import and index separately).
Just for the record, here is also the db load:
time < hominae.txt perl -ne ' chomp; my @arr = split(/;/, $_, 2); print $arr[1], "\n"; ' \ | psql -c " drop table if exists hm; create table if not exists hm (line text, id serial primary key); copy hm (line) from stdin with (format csv, delimiter E'\t', head +er FALSE); "; testdb=# \dti+ hm* List of relations Schema | Name | Type | Owner | Table | Size --------+---------+-------+----------+-------+--------- public | hm | table | aardvark | | 29 GB public | hm_pkey | index | aardvark | hm | 2825 MB (2 rows)
In reply to Re^4: Using indexing for faster lookup in large file
by erix
in thread Using indexing for faster lookup in large file
by anli_
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |