You may find the subthread staring at Re: Index a file with pack for fast access of interest.
The indexing mechanism discussed there isn't directly applicable to your requirements -- it indexes by line number rather than content -- but it should be adaptable to them.
Assuming your sample data is representative -- ie. average record length 47 -- then 30GB represents ~700 million records. And assuming the key numbers are representative, you'd need a 32-bit int to represent those and a 64-bit int to represent the file offset. Hence your index could be built using:
open IN, '<', '/path/to/the/Datafile.txt' or die $!; open out, '>:raw', /path/to/the/indexfile.idx' or die $!; my $pos = 0; print( OUT pack 'VQ', m[^(\d+),], $pos ), $pos = tell( IN ) while <IN> +; close OUT; close IN;
The output file, with 12 bytes per record would be ~7.6GB.
As the keys in your file appear to be out of order, you would then need to (binary) sort that file.
Once sorted, a binary search would take an average of 30 seeks&reads to locate the appropriate 12-byte index record and another seek&read to get the data record.
If you have a sufficiently well-spec'd machine with (say) 8GB or more of ram, you could load the entire index into memory -- as a single big string and access as a ramfile -- which would probably reduce your lookup time by much more than an order of magnitude.
In reply to Re: Using indexing for faster lookup in large file
by BrowserUk
in thread Using indexing for faster lookup in large file
by anli_
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |