mhearse has asked for the wisdom of the Perl Monks concerning the following question:

I have some large text files which are sorted alphabetically. Also, most lines are variable length. I'd like to make a byte offset index of them, so that I can jump the the position where a line begins with BAX, or TRN, or TUX, or any other combination. I am only interested in the first 3 characters. Would I use pack for this? Right now, my plan is to just grab the first 24 bits of each line, then note the byte offset in the file, keeping only the first unique 24 bits. Is there a more elegant/programatic way?
  • Comment on index first 24 bits of every line in file

Replies are listed 'Best First'.
Re: index first 24 bytes of every line in file
by kyle (Abbot) on Feb 26, 2008 at 04:14 UTC

    I'm not sure I understand your requirements. You're going to read lines, keeping only the first 24 bytes, but you're only interested in the first three characters?

    You can get byte offsets in a file you're reading using tell. Here's something that will output offsets along with line contents:

    my $offset = tell STDIN; while ( my $line = <STDIN> ) { print "[$offset] $line"; } continue { $offset = tell STDIN; }

    Sample output (note that newlines count in the offsets):

    [0] 12345 [6] 1234567890 [17] a [19] b [21] c

    You can pick out the first three characters (or first 24) using substr.

    Once you have your data, you can stuff it into a hash. If there's a lot of it, and you want it to persist, use DBM::Deep. I'm thinking this will be especially useful if your three-letter-codes are keys, and you need to store a list of offsets where they're found.

    my $tlc = substr $line, 0, 3; push @{ $offsets_for{$tlc} }, $offset;

    Hope this helps.

Re: index first 24 bytes of every line in file
by BrowserUk (Patriarch) on Feb 26, 2008 at 04:09 UTC

    It really depends upon why you are indexing the file?

    And how you are going to use that index?

    And, just how large are these files? 3 upper case chars == only 17576 unique values.

    And why are you going to keep 24 bytes of each line? If that is enough to satisfy your needs for each line, then why do you need the index. If not, you are going to have to seek the file and read the line to get the rest, so why keep any at all?

    Your description: " I am only interested in the first 3 characters." and "keeping only the first unique 24 bytes" is confusing.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      And why are you going to keep 24 bytes of each line?

      I think he said he was going to save the first 24 bits (3 characters). Which may mean we're not looking at the same version of his post...

        Look again at the title of this post, oryour post, or mine to which you replied. Or any of the posts posted before he realised his mistake and corrected both the title and content of the OP.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: index first 24 bits of every line in file
by CountZero (Bishop) on Feb 26, 2008 at 06:49 UTC
    If your files are really large and you need to have fast direct access to each of the lines, why not putting them into a database? It looks as if you are trying to re-invent databases! SQLite seems a good candidate for such a problem. Once installed you can then load and access the lines through the usual DBI interface.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James