Re: Index a file with pack for fast access

This creates an index file with the '.idx' appended to the name of the input file:

#! perl -slw
use strict;

open INDEX, '>:raw', "$ARGV[ 0 ].idx" or die $!;
syswrite INDEX, pack( 'N', 0 ), 4;
syswrite INDEX, pack( 'N', tell *ARGV ), 4 while <>;
close INDEX;
[download]

And this loads the appropriate index file for its input argument and the reads 100 records at random:

#! perl -slw
use strict;
use Time::HiRes qw[ time ];

our $N //= 100;

open INDEX, '<:raw', "$ARGV[ 0 ].idx" or die $!;
my $len = -s( INDEX );
sysread INDEX, my( $idx ), $len;
close INDEX;

my $start = time;
open DAT, '<', $ARGV[ 0 ] or die $!;
for( 1 .. $N ) {
    my $toRead = int rand( length( $idx ) / 4 );
    my $offset = unpack 'N', substr $idx, $toRead * 4, 4;
    seek DAT, $offset, 0;
    my $line = <DAT>;
#    print $line;
}
close DAT;

printf "Ave. %.6f seconds/record\n", ( time() -$start ) / $N;
[download]

And here is a console log with timings of indexing a 1gb file containing 16 million records and then reading a 100 records at random via that index:

[23:03:42.25] c:\test>indexFile 1GB.csv

[23:05:08.24] c:\test>readIndexedFile 1GB.csv
Ave. 0.003699 seconds/record

[23:05:40.38] c:\test>readIndexedFile 1GB.csv
Ave. 0.003991 seconds/record
[download]

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Comment on Re: Index a file with pack for fast access Select or Download Code

Replies are listed 'Best First'.
Re^2: Index a file with pack for fast access by Ineffectual (Scribe) on Dec 20, 2011 at 23:22 UTC
Hi BrowserUk, Thanks for your helpful response! I've created the index as you indicated above in your code, using this: `open(IN, $oneper) or die "Can't open file $oneper for reading: $!\n"; open(INDEX, ">:raw","$file.idx") or die "Can't open $file.idx for read +/write: $!\n"; syswrite INDEX, pack('N',0),4; while (<IN>) { syswrite INDEX, pack('N', tell INDEX), 4; } close INDEX;` [download] I've created a file to read, as follows: `open INDEX, "<:raw","$index" or die "Can't open $index for reading: $! +"; my $len = -s( INDEX ); sysread INDEX, my( $idx ), $len; close INDEX; open FILE, "<$oneper" or die "Can't open $oneper for reading: $!"; foreach my $lineNum (sort {$a cmp $b} keys %todo) { my $offset = unpack 'N', substr $idx, $lineNum * 4, 4; print "offset is $offset for linenum $lineNum\n<br>"; seek FILE, $offset, 0; my $line = <FILE>; print "found line $line\n"; }` [download] The start of my file is: 1 NoResults 2 NoResults 3 13 32446841 0 4 13 32447221 0 5 7 91839109 1 6 7 91747130 1 7 7 91779556 1 8 7 92408328 0 9 7 92373453 0 10 7 92383887 0 11 7 11364200 0 12 7 11337163 0 When I supply lineNum 3 it gives me back: offset is 12 for linenum 3 found line 2 NoResults What have I done wrong? :( It feels like it's not indexing an entire line?	[reply] [d/l] [select]
Re^3: Index a file with pack for fast access by BrowserUk (Patriarch) on Dec 20, 2011 at 23:35 UTC
What have I done wrong? The index is zero based, so if you want to treat the first line in the file as 1st rather than 0th, you need to substract 1 from the number before looking it up in the index: `open INDEX, "<:raw","$index" or die "Can't open $index for reading: $! +"; my $len = -s( INDEX ); sysread INDEX, my( $idx ), $len; close INDEX; open FILE, "<$oneper" or die "Can't open $oneper for reading: $!"; foreach my $lineNum (sort {$a cmp $b} keys %todo) { my $offset = unpack 'N', substr $idx, ( $lineNum - 1 ) * 4, 4; ## + Modified!! print "offset is $offset for linenum $lineNum\n<br>"; seek FILE, $offset, 0; my $line = <FILE>; print "found line $line\n"; }` [download] That should do the trick. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply] [d/l]
Re^4: Index a file with pack for fast access by Ineffectual (Scribe) on Dec 21, 2011 at 17:31 UTC
Hi. :) I made the change and it works for line 1: `offset is 0 for linenum 1 found line 1 NoResults` [download] But it doesn't work for line 2 or any of the rest of the lines: `offset is 4 for linenum 2 found line Results` [download] This should give the same thing as the first line, that is: `offset is 4 for linenum 2 found line 2 NoResults` [download] For line 500 for example, it gives: `offset is 1996 for linenum 500 found line 53721 0` [download] Whereas line 500 in the file is: `500 NotOn` [download] Searching for 53721 in my file gives me a match on line 107 rather than line 500. Do I need to recode my index using N* or Z or Z* or A or A*? Thanks!	[reply] [d/l] [select]
Re^5: Index a file with pack for fast access by BrowserUk (Patriarch) on Dec 21, 2011 at 17:51 UTC
Re^6: Index a file with pack for fast access by Ineffectual (Scribe) on Dec 21, 2011 at 18:56 UTC
Some notes below your chosen depth have not been shown here