Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
Hello all, I have a 4 column text file that looks like this:
863     1       182856796       0
864     1       182856743       0
865     1       182856690       0
866     1       182856800       0
867     4       147950905       0
868     9       101911655       0
869     9       33113120        1
870     16      79237586        0
871     2       150329972       0
872     10      131981014       1
873     1       236140738       1
874     X       102930959       1
875     2       68407925        1

The first column is the line number. I want to create an index on this file so that I can quickly access the file by line number. I found a recipe that seems to do that in the Perl Cookbook recipe 8.8 - Reading a Particular Line in a File.

However, it is not retrieving the lines properly (probably because my data consists of strings and not unsigned longs). I attempted to re-run the index using Z* and Z and N* and other encodings, but I don't understand pack well enough to know if I'm doing it correctly and I've never managed to get the right string back from my unpack.
open(IN, $oneper) or die "Can't open file $oneper for reading: $!\n"; open(INDEX, "+>$file.idx") or die "Can't open $file.idx for read/write +: $!\n"; build_index(*IN, *INDEX); # usage: build_index(*DATA_HANDLE, *INDEX_HANDLE) sub build_index { my $data_file = shift; my $index_file = shift; my $offset = 0; while (<$data_file>) { print $index_file pack("N", $offset); $offset = tell($data_file); } }
Unpack code:
# usage: line_with_index(*DATA_HANDLE, *INDEX_HANDLE, $LINE_NUMBER) # returns line or undef if LINE_NUMBER was out of range sub line_with_index { my $data_file = shift; my $index_file = shift; my $line_number = shift; my $size; # size of an index entry my $i_offset; # offset into the index of the entry my $entry; # index entry my $d_offset; # offset into the data file $size = length(pack("N", 0)); $i_offset = $size * ($line_number); print "size is $size offset is $i_offset\n"; seek($index_file, $i_offset, 0) or return; read($index_file, $entry, $size); $d_offset = unpack("N", $entry); seek($data_file, $d_offset, 0); return scalar(<$data_file>); }
Asking for line 3 using this code gives me back:
size is 4
offset is 12
found line 39109 1 (incorrect data that appears to be from the middle of line 5)

Thanks in advance for your help!

Update:
My file also contains lines that look like:
513     7       126096599       0
514     Multi
515     7       126116797       0
516     NotOn
517     7       126120072       0
518     7       126129103       0
519     7       126129249       0
520     7       126141464       0
521     7       126172869       0
522     7       126177331       0
523     7       126183528       0
524     19      49379166        1
525     2       172414527       1
526     7
527     19      49379181        1
528     2       172414461       1
529     4       39549110        0
530     21      40195276        1
531     No Results
532     14      39651192        0
533     7
534     7
So the 34 bytes per line isn't true. Sorry.

In reply to Index a file with pack for fast access by Ineffectual

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (4)
As of 2024-04-18 00:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found