Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re: Index a file with pack for fast access

by BrowserUk (Patriarch)
on Dec 16, 2011 at 23:07 UTC ( [id://944026]=note: print w/replies, xml ) Need Help??


in reply to Index a file with pack for fast access

This creates an index file with the '.idx' appended to the name of the input file:

#! perl -slw use strict; open INDEX, '>:raw', "$ARGV[ 0 ].idx" or die $!; syswrite INDEX, pack( 'N', 0 ), 4; syswrite INDEX, pack( 'N', tell *ARGV ), 4 while <>; close INDEX;

And this loads the appropriate index file for its input argument and the reads 100 records at random:

#! perl -slw use strict; use Time::HiRes qw[ time ]; our $N //= 100; open INDEX, '<:raw', "$ARGV[ 0 ].idx" or die $!; my $len = -s( INDEX ); sysread INDEX, my( $idx ), $len; close INDEX; my $start = time; open DAT, '<', $ARGV[ 0 ] or die $!; for( 1 .. $N ) { my $toRead = int rand( length( $idx ) / 4 ); my $offset = unpack 'N', substr $idx, $toRead * 4, 4; seek DAT, $offset, 0; my $line = <DAT>; # print $line; } close DAT; printf "Ave. %.6f seconds/record\n", ( time() -$start ) / $N;

And here is a console log with timings of indexing a 1gb file containing 16 million records and then reading a 100 records at random via that index:

[23:03:42.25] c:\test>indexFile 1GB.csv [23:05:08.24] c:\test>readIndexedFile 1GB.csv Ave. 0.003699 seconds/record [23:05:40.38] c:\test>readIndexedFile 1GB.csv Ave. 0.003991 seconds/record

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Replies are listed 'Best First'.
Re^2: Index a file with pack for fast access
by Ineffectual (Scribe) on Dec 20, 2011 at 23:22 UTC
    Hi BrowserUk,

    Thanks for your helpful response! I've created the index as you indicated above in your code, using this:
    open(IN, $oneper) or die "Can't open file $oneper for reading: $!\n"; open(INDEX, ">:raw","$file.idx") or die "Can't open $file.idx for read +/write: $!\n"; syswrite INDEX, pack('N',0),4; while (<IN>) { syswrite INDEX, pack('N', tell INDEX), 4; } close INDEX;

    I've created a file to read, as follows:
    open INDEX, "<:raw","$index" or die "Can't open $index for reading: $! +"; my $len = -s( INDEX ); sysread INDEX, my( $idx ), $len; close INDEX; open FILE, "<$oneper" or die "Can't open $oneper for reading: $!"; foreach my $lineNum (sort {$a cmp $b} keys %todo) { my $offset = unpack 'N', substr $idx, $lineNum * 4, 4; print "offset is $offset for linenum $lineNum\n<br>"; seek FILE, $offset, 0; my $line = <FILE>; print "found line $line\n"; }

    The start of my file is:
    1       NoResults
    2       NoResults
    3       13      32446841        0
    4       13      32447221        0
    5       7       91839109        1
    6       7       91747130        1
    7       7       91779556        1
    8       7       92408328        0
    9       7       92373453        0
    10      7       92383887        0
    11      7       11364200        0
    12      7       11337163        0
    

    When I supply lineNum 3 it gives me back:
    offset is 12 for linenum 3
    found line 2 NoResults
    

    What have I done wrong? :( It feels like it's not indexing an entire line?
      What have I done wrong?

      The index is zero based, so if you want to treat the first line in the file as 1st rather than 0th, you need to substract 1 from the number before looking it up in the index:

      open INDEX, "<:raw","$index" or die "Can't open $index for reading: $! +"; my $len = -s( INDEX ); sysread INDEX, my( $idx ), $len; close INDEX; open FILE, "<$oneper" or die "Can't open $oneper for reading: $!"; foreach my $lineNum (sort {$a cmp $b} keys %todo) { my $offset = unpack 'N', substr $idx, ( $lineNum - 1 ) * 4, 4; ## + Modified!! print "offset is $offset for linenum $lineNum\n<br>"; seek FILE, $offset, 0; my $line = <FILE>; print "found line $line\n"; }

      That should do the trick.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        Hi. :) I made the change and it works for line 1:
        offset is 0 for linenum 1 found line 1 NoResults

        But it doesn't work for line 2 or any of the rest of the lines:
        offset is 4 for linenum 2 found line Results

        This should give the same thing as the first line, that is:
        offset is 4 for linenum 2 found line 2 NoResults

        For line 500 for example, it gives:
        offset is 1996 for linenum 500 found line 53721 0

        Whereas line 500 in the file is:
        500 NotOn

        Searching for 53721 in my file gives me a match on line 107 rather than line 500.
        Do I need to recode my index using N* or Z or Z* or A or A*? Thanks!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://944026]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (2)
As of 2024-04-18 23:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found