in reply to index first 24 bits of every line in file

It really depends upon why you are indexing the file?

And how you are going to use that index?

And, just how large are these files? 3 upper case chars == only 17576 unique values.

And why are you going to keep 24 bytes of each line? If that is enough to satisfy your needs for each line, then why do you need the index. If not, you are going to have to seek the file and read the line to get the rest, so why keep any at all?

Your description: " I am only interested in the first 3 characters." and "keeping only the first unique 24 bytes" is confusing.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."
  • Comment on Re: index first 24 bytes of every line in file

Replies are listed 'Best First'.
Re^2: index first 24 bytes of every line in file
by apl (Monsignor) on Feb 26, 2008 at 10:54 UTC
    And why are you going to keep 24 bytes of each line?

    I think he said he was going to save the first 24 bits (3 characters). Which may mean we're not looking at the same version of his post...

      Look again at the title of this post, oryour post, or mine to which you replied. Or any of the posts posted before he realised his mistake and corrected both the title and content of the OP.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Sorry for my error (24 bits not bytes). Here is some additional information. The files I'm working with aren't that large (about 2MB). Down the road they will be up to 500MB. I know I could put the data into a table and just use a sql query (with appropriate indexes). I'd like to explore using other techniques which don't use a database. Basically, I'm look for a programatic alternative to:
        open INPUT, "$ENV{HOME}/flat_file" or die $!; my %ports; while (my $line = <INPUT>) { chomp $line; my ($code, $city) = split /\|/, $line; $ports{$code} = $city if not $ports{$code}; } for my $key (keys %ports) { if ($ARGV[1] =~ /$key/) { print $ports{$key}, "\n"; } } close INPUT;
        Or the same using store/retrieve to avoid the repeated parsing:
        use Storable qw(retrieve); my $ports = retrieve("$ENV{HOME}/flat_file.dat");
        Last night I wrote a script which creates an index of the first three bytes, then saves it using Storable. The following program uses the index to print out the byte offset of ONLY AN EXACT MATCH:
        #!/usr/bin/perl use strict; use Getopt::Std; my %parms; getopts ("c:p:", \%parms); die "Please supply a port code or city" if not $parms{p} and not $parms{c}; use Storable qw(retrieve); if ($parms{p}) { $parms{p} = uc $parms{p}; my $ports_by_code = retrieve("$ENV{HOME}/flat_file_by_port"); print $ports_by_code->{$parms{p}}, "\n"; } if ($parms{c}) { $parms{c} = uc $parms{c}; my $ports_by_city = retrieve("$ENV{HOME}/flat_file_by_city"); print $ports_by_city->{$parms{c}}, "\n"; }
        I'd like to be able to handle more conditions. Say the user enters only the letter M. I would like to print out the first 25 matches that start with M. Or if they enter MV, the first 25 matches that start with MV. The max input lenght is 3 characters. My assumption (probably erroneous) is that I need three byte offset indexes to handle my requirements. Is there another programatic approach, besides DBI, or iterating over the entire list of keys and doing a contains/starts with search? I'm just looking for ideas. Any input is appreciated.