http://qs1969.pair.com?node_id=944000

Ineffectual has asked for the wisdom of the Perl Monks concerning the following question:

Hello all, I have a 4 column text file that looks like this:
863     1       182856796       0
864     1       182856743       0
865     1       182856690       0
866     1       182856800       0
867     4       147950905       0
868     9       101911655       0
869     9       33113120        1
870     16      79237586        0
871     2       150329972       0
872     10      131981014       1
873     1       236140738       1
874     X       102930959       1
875     2       68407925        1

The first column is the line number. I want to create an index on this file so that I can quickly access the file by line number. I found a recipe that seems to do that in the Perl Cookbook recipe 8.8 - Reading a Particular Line in a File.

However, it is not retrieving the lines properly (probably because my data consists of strings and not unsigned longs). I attempted to re-run the index using Z* and Z and N* and other encodings, but I don't understand pack well enough to know if I'm doing it correctly and I've never managed to get the right string back from my unpack.
open(IN, $oneper) or die "Can't open file $oneper for reading: $!\n"; open(INDEX, "+>$file.idx") or die "Can't open $file.idx for read/write +: $!\n"; build_index(*IN, *INDEX); # usage: build_index(*DATA_HANDLE, *INDEX_HANDLE) sub build_index { my $data_file = shift; my $index_file = shift; my $offset = 0; while (<$data_file>) { print $index_file pack("N", $offset); $offset = tell($data_file); } }
Unpack code:
# usage: line_with_index(*DATA_HANDLE, *INDEX_HANDLE, $LINE_NUMBER) # returns line or undef if LINE_NUMBER was out of range sub line_with_index { my $data_file = shift; my $index_file = shift; my $line_number = shift; my $size; # size of an index entry my $i_offset; # offset into the index of the entry my $entry; # index entry my $d_offset; # offset into the data file $size = length(pack("N", 0)); $i_offset = $size * ($line_number); print "size is $size offset is $i_offset\n"; seek($index_file, $i_offset, 0) or return; read($index_file, $entry, $size); $d_offset = unpack("N", $entry); seek($data_file, $d_offset, 0); return scalar(<$data_file>); }
Asking for line 3 using this code gives me back:
size is 4
offset is 12
found line 39109 1 (incorrect data that appears to be from the middle of line 5)

Thanks in advance for your help!

Update:
My file also contains lines that look like:
513     7       126096599       0
514     Multi
515     7       126116797       0
516     NotOn
517     7       126120072       0
518     7       126129103       0
519     7       126129249       0
520     7       126141464       0
521     7       126172869       0
522     7       126177331       0
523     7       126183528       0
524     19      49379166        1
525     2       172414527       1
526     7
527     19      49379181        1
528     2       172414461       1
529     4       39549110        0
530     21      40195276        1
531     No Results
532     14      39651192        0
533     7
534     7
So the 34 bytes per line isn't true. Sorry.

Replies are listed 'Best First'.
Re: Index a file with pack for fast access
by moritz (Cardinal) on Dec 16, 2011 at 19:07 UTC

    If the rest of your file looks the same, you don't even need an index. Each line that you pasted has 34 characters (assuming that the newline is a single line-feed, and not CR LF as on windows), so to get the $n-th line (and you start counting the lines from 0), you can just do

    my $bytes_per_line = 34; my $pos = $line_number * $bytes_per_line; seek(IN, $pos, 0); my $line = <IN>;

    If you need to access by the first column (and not line number), subtract the value of the first row before calculating the line number.

      I updated the original post. Not all of the lines are 34 bytes long and I should have posted a better cross section of the file. Sorry.
Re: Index a file with pack for fast access
by BrowserUk (Patriarch) on Dec 16, 2011 at 23:07 UTC

    This creates an index file with the '.idx' appended to the name of the input file:

    #! perl -slw use strict; open INDEX, '>:raw', "$ARGV[ 0 ].idx" or die $!; syswrite INDEX, pack( 'N', 0 ), 4; syswrite INDEX, pack( 'N', tell *ARGV ), 4 while <>; close INDEX;

    And this loads the appropriate index file for its input argument and the reads 100 records at random:

    #! perl -slw use strict; use Time::HiRes qw[ time ]; our $N //= 100; open INDEX, '<:raw', "$ARGV[ 0 ].idx" or die $!; my $len = -s( INDEX ); sysread INDEX, my( $idx ), $len; close INDEX; my $start = time; open DAT, '<', $ARGV[ 0 ] or die $!; for( 1 .. $N ) { my $toRead = int rand( length( $idx ) / 4 ); my $offset = unpack 'N', substr $idx, $toRead * 4, 4; seek DAT, $offset, 0; my $line = <DAT>; # print $line; } close DAT; printf "Ave. %.6f seconds/record\n", ( time() -$start ) / $N;

    And here is a console log with timings of indexing a 1gb file containing 16 million records and then reading a 100 records at random via that index:

    [23:03:42.25] c:\test>indexFile 1GB.csv [23:05:08.24] c:\test>readIndexedFile 1GB.csv Ave. 0.003699 seconds/record [23:05:40.38] c:\test>readIndexedFile 1GB.csv Ave. 0.003991 seconds/record

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      Hi BrowserUk,

      Thanks for your helpful response! I've created the index as you indicated above in your code, using this:
      open(IN, $oneper) or die "Can't open file $oneper for reading: $!\n"; open(INDEX, ">:raw","$file.idx") or die "Can't open $file.idx for read +/write: $!\n"; syswrite INDEX, pack('N',0),4; while (<IN>) { syswrite INDEX, pack('N', tell INDEX), 4; } close INDEX;

      I've created a file to read, as follows:
      open INDEX, "<:raw","$index" or die "Can't open $index for reading: $! +"; my $len = -s( INDEX ); sysread INDEX, my( $idx ), $len; close INDEX; open FILE, "<$oneper" or die "Can't open $oneper for reading: $!"; foreach my $lineNum (sort {$a cmp $b} keys %todo) { my $offset = unpack 'N', substr $idx, $lineNum * 4, 4; print "offset is $offset for linenum $lineNum\n<br>"; seek FILE, $offset, 0; my $line = <FILE>; print "found line $line\n"; }

      The start of my file is:
      1       NoResults
      2       NoResults
      3       13      32446841        0
      4       13      32447221        0
      5       7       91839109        1
      6       7       91747130        1
      7       7       91779556        1
      8       7       92408328        0
      9       7       92373453        0
      10      7       92383887        0
      11      7       11364200        0
      12      7       11337163        0
      

      When I supply lineNum 3 it gives me back:
      offset is 12 for linenum 3
      found line 2 NoResults
      

      What have I done wrong? :( It feels like it's not indexing an entire line?
        What have I done wrong?

        The index is zero based, so if you want to treat the first line in the file as 1st rather than 0th, you need to substract 1 from the number before looking it up in the index:

        open INDEX, "<:raw","$index" or die "Can't open $index for reading: $! +"; my $len = -s( INDEX ); sysread INDEX, my( $idx ), $len; close INDEX; open FILE, "<$oneper" or die "Can't open $oneper for reading: $!"; foreach my $lineNum (sort {$a cmp $b} keys %todo) { my $offset = unpack 'N', substr $idx, ( $lineNum - 1 ) * 4, 4; ## + Modified!! print "offset is $offset for linenum $lineNum\n<br>"; seek FILE, $offset, 0; my $line = <FILE>; print "found line $line\n"; }

        That should do the trick.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

Re: Index a file with pack for fast access
by BrowserUk (Patriarch) on Dec 16, 2011 at 19:52 UTC

    open(INDEX, "+>$file.idx") or die "Can't open $file.idx for read/write: $!\n";</c>

    If you are using Windows -- and it will do no harm even if you are not-- one thing you should be doing is binmodeing your index file.

    pack 'N' can produce values containing bytes that look like various control characters -- \r, \n, ^Z etc. -- that will cause Windows to do nasty things to your index file unless you tell it that the file will contain binary data.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

Re: Index a file with pack for fast access
by juster (Friar) on Dec 16, 2011 at 23:38 UTC
    Asking for line 3 using this code gives me back: size is 4 offset is 12 found line 39109 1 (incorrect data that appears to be from the middle of line 5)

    There is an error in your logic. The offset for line 3 should be 8. Line 1 is offset 0, line 2 is 4, line 3 is 8. The lines are one-based and the index offsets are zero-based so the offset to the index would be size * (line-1).

    A similar one-off error is in your index writing sub. You never write the offset to the last line of the text file. If I were writing this I would copy BrowserUk's example and simply write an offset of 0 to the index file straight away.

Re: Index a file with pack for fast access
by sundialsvc4 (Abbot) on Dec 21, 2011 at 00:19 UTC

    Now, this is an interesting thread, and I would very sincerely like to know, BrowserUK, why you advocate constructing an index, as it were, “manually,” instead of either (a) putting the data into (say...) a SQLite database file, or (b) at least indexing the data using such a mechanism.   I mean ... without(!!) debating the validity of your chosen solution to this problem, why did you choose it?   I am genuinely interested in your response.

      I would very sincerely like to know, BrowserUK, why you advocate constructing an index, as it were, “manually,” instead of either (a) putting the data into (say...) a SQLite database file,

      Firstly, I didn't advocate it. I simply supplied an answer to the OPs question.

      But, there are several circumstances under which I might (and have) used it.

      • If the file is huge. Ie. > 4GB.

        Feeding all the data into a DB will require (at least) double the diskspace. Even if only temporarily.

        SQLite is better in this regard than most other DBs, but it still requires substantially more space than creating an external index to the existing file.

      • If the file contains records that are not in a convenient format for bulk importation to the DB.

        SQLite's .import command expects char-delimited (tabs, commas, pipes or similar) fields within newline delimited records.

        It cannot handle (without preprocessing) importing data from;

        • Fixed length undelimited fields from a binary file.
        • Multi-line records. Eg FASTA files.

        The Perl indexer can do both those and many other variations in a single pass.

      • If I was in a hurry.

        SQLite's .import file table command can take days to import 16 million records from a 1GB file. I can index it with Perl in 80 seconds.

        If you discover the appropriate set of magic incantations -- which the SQLite folk seem to take perverse pleasure in keeping hidden -- then their bulk loader can be made to get much closer to Perl's performance, but then adding an index takes hours.

      • Once built, SQLite's random access performance is very good:
        #! perl -slw use strict; use Time::HiRes qw[ time ]; use DBI; our $N //= 1e3; my $start = time; my $dbh = DBI->connect( 'dbi:SQLite:dbname=1gb.db','','' ) or die DBI::errstr; my $sth = $dbh->prepare( 'select * from onegb where ROWID = ?' ) or die DBI->errstr; for( 1 .. $N ) { my $rowID = 1+int( rand 16*1024**2-1 ); $sth->execute( $rowID ) or die $sth->errstr; my $data = $sth->fetch or die $sth->errstr; my @row = @{ $data }; } printf "Ave: %.6f seconds/record\n", ( time() - $start ) / $N; __END__ c:\test>1gbdb -N=1e6 Ave: 0.000711 seconds/record

        , but it still falls short of using the perl-built index:

        #! perl -slw use strict; use Time::HiRes qw[ time ]; our $N //= 100; my $start = time; open INDEX, '<:raw', "$ARGV[ 0 ].idx" or die $!; my $len = -s( INDEX ); sysread INDEX, my( $idx ), $len; close INDEX; sub getRecordN { my( $fh, $n ) = @_; seek $fh, unpack( 'N', substr $idx, $n * 4, 4 ), 0; return scalar <$fh>; } scalar <ARGV>; my @lines; $#lines = $N; $#lines = 0; for( 1 .. $N ) { my $toRead = int rand( length( $idx ) / 4 ); my $line = getRecordN( \*ARGV, $toRead ); # push @lines, "$toRead : $offset : ", $line; } printf "Ave: %.6f seconds/record\n", ( time() -$start ) / $N; __END__ c:\test>readIndexedFile -N=1e6 1GB.csv Ave. 0.000481 seconds/record

      SQLite is obviously more powerful, but if you do not need that, it is of no benefit. And it comes at the expense of less flexibility unless you drive it from Perl, at which point you're into populating it via the DBI interface with the inherent slowdown.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        Interesting.   (Upvoted.)   Of course it is understood that you are answering, not advocating, but I found the answer interesting and informative.

        As an aside, one characteristic of SQLite that “bit me bad” at first is the way that this system handles transactions.   Basically, you must have one, because if you don’t, SQLite will physically verify every single disk write by reading the information again.   Which certainly can result in the “hours or days” concern, and then rather dramatically relieve that concern.   I admit that I tend towards the use of SQL-based systems mainly so that I can subsequently run queries against them.   Perhaps I do not use hand-built searching techniques enough.   Thanks for your example.

      ... would very sincerely like to know, BrowserUK, ...

      perhaps you can ask BrowserUK directly, by responding to one of his posts?