in reply to Using indexing for faster lookup in large file
You may find the subthread staring at Re: Index a file with pack for fast access of interest.
The indexing mechanism discussed there isn't directly applicable to your requirements -- it indexes by line number rather than content -- but it should be adaptable to them.
Assuming your sample data is representative -- ie. average record length 47 -- then 30GB represents ~700 million records. And assuming the key numbers are representative, you'd need a 32-bit int to represent those and a 64-bit int to represent the file offset. Hence your index could be built using:
open IN, '<', '/path/to/the/Datafile.txt' or die $!; open out, '>:raw', /path/to/the/indexfile.idx' or die $!; my $pos = 0; print( OUT pack 'VQ', m[^(\d+),], $pos ), $pos = tell( IN ) while <IN> +; close OUT; close IN;
The output file, with 12 bytes per record would be ~7.6GB.
As the keys in your file appear to be out of order, you would then need to (binary) sort that file.
Once sorted, a binary search would take an average of 30 seeks&reads to locate the appropriate 12-byte index record and another seek&read to get the data record.
If you have a sufficiently well-spec'd machine with (say) 8GB or more of ram, you could load the entire index into memory -- as a single big string and access as a ramfile -- which would probably reduce your lookup time by much more than an order of magnitude.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Using indexing for faster lookup in large file
by anli_ (Novice) on Feb 27, 2015 at 23:11 UTC | |
by BrowserUk (Patriarch) on Feb 27, 2015 at 23:48 UTC |