This creates an index file with the '.idx' appended to the name of the input file:
#! perl -slw
use strict;
open INDEX, '>:raw', "$ARGV[ 0 ].idx" or die $!;
syswrite INDEX, pack( 'N', 0 ), 4;
syswrite INDEX, pack( 'N', tell *ARGV ), 4 while <>;
close INDEX;
And this loads the appropriate index file for its input argument and the reads 100 records at random:
#! perl -slw
use strict;
use Time::HiRes qw[ time ];
our $N //= 100;
open INDEX, '<:raw', "$ARGV[ 0 ].idx" or die $!;
my $len = -s( INDEX );
sysread INDEX, my( $idx ), $len;
close INDEX;
my $start = time;
open DAT, '<', $ARGV[ 0 ] or die $!;
for( 1 .. $N ) {
my $toRead = int rand( length( $idx ) / 4 );
my $offset = unpack 'N', substr $idx, $toRead * 4, 4;
seek DAT, $offset, 0;
my $line = <DAT>;
# print $line;
}
close DAT;
printf "Ave. %.6f seconds/record\n", ( time() -$start ) / $N;
And here is a console log with timings of indexing a 1gb file containing 16 million records and then reading a 100 records at random via that index:
[23:03:42.25] c:\test>indexFile 1GB.csv
[23:05:08.24] c:\test>readIndexedFile 1GB.csv
Ave. 0.003699 seconds/record
[23:05:40.38] c:\test>readIndexedFile 1GB.csv
Ave. 0.003991 seconds/record
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.