Re^4: index first 24 bytes of every line in file

Sorry for my error (24 bits not bytes). Here is some additional information. The files I'm working with aren't that large (about 2MB). Down the road they will be up to 500MB. I know I could put the data into a table and just use a sql query (with appropriate indexes). I'd like to explore using other techniques which don't use a database. Basically, I'm look for a programatic alternative to:

open INPUT, "$ENV{HOME}/flat_file"
   or die $!;

my %ports;
while (my $line = <INPUT>) {
    chomp $line;
    my ($code, $city) = split /\|/, $line; 
    $ports{$code} = $city
       if not $ports{$code};
}

for my $key (keys %ports) {
    if ($ARGV[1] =~ /$key/) {
        print $ports{$key}, "\n";
    }
}

close INPUT;
[download]

Or the same using store/retrieve to avoid the repeated parsing:

use Storable qw(retrieve);

my $ports = retrieve("$ENV{HOME}/flat_file.dat");
[download]

Last night I wrote a script which creates an index of the first three bytes, then saves it using Storable. The following program uses the index to print out the byte offset of ONLY AN EXACT MATCH:

#!/usr/bin/perl

use strict;
use Getopt::Std;

my %parms;
getopts ("c:p:", \%parms);

die "Please supply a port code or city"
    if not $parms{p} and not $parms{c};

use Storable qw(retrieve);

if ($parms{p}) {
    $parms{p} = uc $parms{p};
    my $ports_by_code = retrieve("$ENV{HOME}/flat_file_by_port");
    print $ports_by_code->{$parms{p}}, "\n";
}
if ($parms{c}) {
    $parms{c} = uc $parms{c};
    my $ports_by_city = retrieve("$ENV{HOME}/flat_file_by_city");
    print $ports_by_city->{$parms{c}}, "\n";
}
[download]

I'd like to be able to handle more conditions. Say the user enters only the letter M. I would like to print out the first 25 matches that start with M. Or if they enter MV, the first 25 matches that start with MV. The max input lenght is 3 characters. My assumption (probably erroneous) is that I need three byte offset indexes to handle my requirements. Is there another programatic approach, besides DBI, or iterating over the entire list of keys and doing a contains/starts with search? I'm just looking for ideas. Any input is appreciated.

Comment on Re^4: index first 24 bytes of every line in file Select or Download Code

Replies are listed 'Best First'.
Re^5: index first 24 bytes of every line in file by BrowserUk (Patriarch) on Feb 27, 2008 at 02:18 UTC
My assumption (probably erroneous) is that I need three byte offset indexes to handle my requirements. I don't get what your getting here. In a 500 MB file, you will require 29-bits (229 == 512 MB) offsets to index the lines in the file, regardless of the length of the keys. Ie. You will need a full 32-bit integer to store those offsets. If your files are ever likely to get bigger than 4GB, you would need to move to using doubles--so 8 bytes per offset. The number of characters you select from the start of your lines to form the key will not affect the offset sizes. Now the keys. You say you only need 3 characters. And in all your examples, you use only upper case letters. As I pointed out above, there are only 17,576 combinations, so unless your lines are very, very long, you must be expecting duplicate lines with the same first 3 characters? If that is the case, and the file is sorted as you said above, then you only need build an index of the up to 17,576 starting points for groups of records that share the same first characters as you would have no way to select more closely than that. Now the requirements.** You say you'd like to return the first 25 matches of a partial input. Then in a 500MB file, assuming say 50-char lines (as suggested above) then you will have circa. 600 lines starting with 'AAA'. Say the input was 'AA'. The first (sorted ordered) 25 entries will (on average) all start with 'AAA'. Say the input was just 'A'. Again, the first 25 will all start with 'AAA'. Now perhaps what you mean is that if the input is just 'A', you want to display the first one that starts 'AA', and the first that starts 'AB' ... the first that starts 'AX'? And if the input is 'AA', then the first that starts 'AAA', the first that starts 'AAB', ... the first that starts 'AAX'? (Assuming there are actually lines in the file matching those.) If so, then you needn't index any more lines, because you can achieve that with just the 17,576 entry index and a little code: `my $input = ...; if( length( $in ) == 3 ) { seek DATA, $index{ $in }, 0; print scalar <DATA> for 1 .. 25 } elsif( length( $in ) == 2 ) { for( 'A' .. 'X' ) { seek DATA, $index{ $in . $_ }, 0; print scalar <DATA>; } } elsif( length( $in ) == 1 ) { for( 'A' .. 'X' ) { seek DATA, $index{ $in . $_ . 'A' }, 0; print scalar <DATA>; } }` [download] I'm not at all sure any of this is helpful because it is still very unclear what you are trying to achieve. Maybe a few sample lines of the data file with a couple of examples of inputs and outputs would clarify. Finally, an explanation of why you are avoiding using a DB? If there is any concurrency involved, eg. a web app that maybe used by more than one user at a time; or if the data file is relatively unchanging, then a DB is far simpler. Maybe not the quickest, but certainly the easiest. Once again, without a clearer idea of what you are trying to achieve, it's hard to be more specific. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]

Replies are listed 'Best First'.

Re^5: index first 24 bytes of every line in file
by BrowserUk (Patriarch) on Feb 27, 2008 at 02:18 UTC

My assumption (probably erroneous) is that I need three byte offset indexes to handle my requirements.

I don't get what your getting here. In a 500 MB file, you will require 29-bits (2**29 == 512 MB) offsets to index the lines in the file, regardless of the length of the keys. Ie. You will need a full 32-bit integer to store those offsets. If your files are ever likely to get bigger than 4GB, you would need to move to using doubles--so 8 bytes per offset. The number of characters you select from the start of your lines to form the key will not affect the offset sizes.

Now the keys. You say you only need 3 characters. And in all your examples, you use only upper case letters. As I pointed out above, there are only 17,576 combinations, so unless your lines are very, very long, you must be expecting duplicate lines with the same first 3 characters?

If that is the case, and the file is sorted as you said above, then you only need build an index of the up to 17,576 starting points for groups of records that share the same first characters as you would have no way to select more closely than that.

Now the requirements. You say you'd like to return the first 25 matches of a partial input. Then in a 500MB file, assuming say 50-char lines (as suggested above) then you will have circa. 600 lines starting with 'AAA'.

Say the input was 'AA'. The first (sorted ordered) 25 entries will (on average) all start with 'AAA'.

Say the input was just 'A'. Again, the first 25 will all start with 'AAA'.

Now perhaps what you mean is that if the input is just 'A', you want to display the first one that starts 'AA', and the first that starts 'AB' ... the first that starts 'AX'?

And if the input is 'AA', then the first that starts 'AAA', the first that starts 'AAB', ... the first that starts 'AAX'? (Assuming there are actually lines in the file matching those.)

If so, then you needn't index any more lines, because you can achieve that with just the 17,576 entry index and a little code:

    my $input = ...;
    if( length( $in ) == 3 ) {
        seek DATA, $index{ $in }, 0;
        print scalar <DATA> for 1 .. 25
    }
    elsif( length( $in ) == 2 ) {
        for( 'A' .. 'X' ) {
            seek DATA, $index{ $in . $_ }, 0;
            print scalar <DATA>;
        }
    }
    elsif( length( $in ) == 1 ) {
        for( 'A' .. 'X' ) {
            seek DATA, $index{ $in . $_ . 'A' }, 0;
            print scalar <DATA>;
        }
    }
[download]

I'm not at all sure any of this is helpful because it is still very unclear what you are trying to achieve. Maybe a few sample lines of the data file with a couple of examples of inputs and outputs would clarify.

Finally, an explanation of why you are avoiding using a DB? If there is any concurrency involved, eg. a web app that maybe used by more than one user at a time; or if the data file is relatively unchanging, then a DB is far simpler. Maybe not the quickest, but certainly the easiest.

Once again, without a clearer idea of what you are trying to achieve, it's hard to be more specific.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

[reply]
[d/l]