Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re^3: Index a file with pack for fast access

by BrowserUk (Patriarch)
on Dec 20, 2011 at 23:35 UTC ( [id://944500]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Index a file with pack for fast access
in thread Index a file with pack for fast access

What have I done wrong?

The index is zero based, so if you want to treat the first line in the file as 1st rather than 0th, you need to substract 1 from the number before looking it up in the index:

open INDEX, "<:raw","$index" or die "Can't open $index for reading: $! +"; my $len = -s( INDEX ); sysread INDEX, my( $idx ), $len; close INDEX; open FILE, "<$oneper" or die "Can't open $oneper for reading: $!"; foreach my $lineNum (sort {$a cmp $b} keys %todo) { my $offset = unpack 'N', substr $idx, ( $lineNum - 1 ) * 4, 4; ## + Modified!! print "offset is $offset for linenum $lineNum\n<br>"; seek FILE, $offset, 0; my $line = <FILE>; print "found line $line\n"; }

That should do the trick.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Replies are listed 'Best First'.
Re^4: Index a file with pack for fast access
by Ineffectual (Scribe) on Dec 21, 2011 at 17:31 UTC
    Hi. :) I made the change and it works for line 1:
    offset is 0 for linenum 1 found line 1 NoResults

    But it doesn't work for line 2 or any of the rest of the lines:
    offset is 4 for linenum 2 found line Results

    This should give the same thing as the first line, that is:
    offset is 4 for linenum 2 found line 2 NoResults

    For line 500 for example, it gives:
    offset is 1996 for linenum 500 found line 53721 0

    Whereas line 500 in the file is:
    500 NotOn

    Searching for 53721 in my file gives me a match on line 107 rather than line 500.
    Do I need to recode my index using N* or Z or Z* or A or A*? Thanks!
      Do I need to recode my index using N* or Z or Z* or A or A*? Thanks!

      Dunno! Why do you think that would help?

      You shouldn't need to if you used the code I posted, but I cannot see what you are now using.

      I create a data file:

      c:\test>perl -e"printf qq[Line %010d\n], $_ for 1 .. 25" > junk.dat c:\test>type junk.dat Line 0000000001 Line 0000000002 Line 0000000003 Line 0000000004 Line 0000000005 Line 0000000006 Line 0000000007 Line 0000000008 Line 0000000009 Line 0000000010 Line 0000000011 Line 0000000012 Line 0000000013 Line 0000000014 Line 0000000015 Line 0000000016 Line 0000000017 Line 0000000018 Line 0000000019 Line 0000000020 Line 0000000021 Line 0000000022 Line 0000000023 Line 0000000024 Line 0000000025

      I then index it using the code I posted above:

      c:\test>type indexFile.pl #! perl -sw use strict; open INDEX, '>:raw', "$ARGV[ 0 ].idx" or die $!; syswrite INDEX, pack( 'N', 0 ), 4; syswrite INDEX, pack( 'N', tell *ARGV ), 4 while <>; close INDEX; c:\test>indexFile junk.dat c:\test>dir junk.dat* 21/12/2011 17:45 425 junk.dat 21/12/2011 17:46 104 junk.dat.idx c:\test>

      I then read through the data file via the index:

      c:\test>type readIndexedFile.pl #! perl -sw use strict; use Time::HiRes qw[ time ]; our $N //= 100; open INDEX, '<:raw', "$ARGV[ 0 ].idx" or die $!; my $len = -s( INDEX ); sysread INDEX, my( $idx ), $len; close INDEX; sub getRecordN { my( $fh, $n ) = @_; seek $fh, unpack( 'N', substr $idx, ($n-1) * 4, 4 ), 0; return scalar <$fh>; } open DAT, '<', $ARGV[ 0 ] or die $!; for my $line ( 1 .. ( length( $idx ) / 4 ) - 1 ) { print "Expecting $line; got: ", getRecordN( *DAT, $line ); } c:\test>readIndexedFile junk.dat Expecting 1; got: Line 0000000001 Expecting 2; got: Line 0000000002 Expecting 3; got: Line 0000000003 Expecting 4; got: Line 0000000004 Expecting 5; got: Line 0000000005 Expecting 6; got: Line 0000000006 Expecting 7; got: Line 0000000007 Expecting 8; got: Line 0000000008 Expecting 9; got: Line 0000000009 Expecting 10; got: Line 0000000010 Expecting 11; got: Line 0000000011 Expecting 12; got: Line 0000000012 Expecting 13; got: Line 0000000013 Expecting 14; got: Line 0000000014 Expecting 15; got: Line 0000000015 Expecting 16; got: Line 0000000016 Expecting 17; got: Line 0000000017 Expecting 18; got: Line 0000000018 Expecting 19; got: Line 0000000019 Expecting 20; got: Line 0000000020 Expecting 21; got: Line 0000000021 Expecting 22; got: Line 0000000022 Expecting 23; got: Line 0000000023 Expecting 24; got: Line 0000000024 Expecting 25; got: Line 0000000025

      And everything works as expected. If yours doesn't, then you will have to work out how your code differs from mine.

      Or failing that, you could post your indexing and reading code, and we might be able to help you. But answering your questions without being able to see your current code isn't possible.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        I was thinking it would help to recode it using something else because it seems like what's happening is that the entire line isn't fitting in 4 bytes. Maybe it's the tabs?

        I've uploaded the three files I'm using to gist

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://944500]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (6)
As of 2024-04-18 10:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found