Re: Code efficiency / algorithm

The way you describe it, it looks like each record in datafile2 could (and often will) match more than one record in datafile1. Do you intend for your results file to show all the "file1" matches for a given "file2" record?

Apart from that, it looks like the nine-digit strings that are labeled "sid" and "eid" in the file1 (%VAR1) data structure are supposed to be the cue for deciding whether a given record in "file2" is a match, based on its first field -- that is, the first line in your "datafile2" example, which starts with "200110100", ought to be a match for the first date range in all three company records from "datafile1". Have I got that right? (the post is a bit confusing, because the Data::Dumper-like output content doesn't match the sample file excerpt)

If so, then I think my first inclination would be to make the "join" data the outer-most layer of the "file1" data structure, and make it as easy as possible to identify the matches -- something like this (based on the data in your example "file1" excerpt):

$VAR1 =
{
   '200210014 200210105' => 
    [ 
       "ABC Corp. / 1 / some text description",
       "XYZ Ltd. / 1 / some text description",
       "CDC Inc. / 1 / some text description",
    ],
   '200211011 200212053' =>
    [
       "ABC Corp. / 2 / some text description",
       "XYZ Ltd. / 2 / some text description",
    ],
   '200323021 200331234' =>
    [
       "ABC Corp. / 3 / some text description",
    ],

   etc...
}
[download]

In other words, file1 fills a hash of arrays, where the hash keys are "start_id end_id" for each date range found in file1; each of these hash elements holds an array of one or more company records, where each record is potentially just a single structured string, holding whatever is relevant for your results file.

With this sort of data structure from file1, you can now read file2 and use the first field of each line to jump directly to the relevant file1 data (untested code, naturally):

while (<FILE2>) 
{
   my ($key2,$data) = split(/,/, $_, 2);

# use grep to do the "join":
   my @match_keys = grep { my ($sid,$eid) = split(/ /,$_);
                           $key2 >= $sid and $key2 <= $eid } keys %VAR
+1;

   foreach my $matched_range ( @match_keys ) {
      my @matched_data = @{$VAR1{$matched_range}};
      # do something with @matched_data
   }
}
[download]

Comment on Re: Code efficiency / algorithm Select or Download Code

Replies are listed 'Best First'.
Re: Re: Code efficiency / algorithm by dave8775 (Novice) on Jan 14, 2003 at 06:43 UTC
I posted some clarifications regarding what I am trying to do in my answer to the previous reply. Basically I am trying to match each number (e.g. 200210201) in datafile2 to any of the the given ranges in datafile1. If the number in datafile2 is within a range in datafile1 than it is a match. I used the concatenation of the two numbers in datafile1 to establish a unique 'rangeid' to be used as the hash key. In other words, '2 200534011 200577234 some text description' has a hash id of 20052. I then read each line of datafile2 and check through the record set to see if I find a hash key that matches (has the same 4 first numbers--i.e.2005). If so then I look closer at the sid and eid. I am just trying to find out better, more efficient ways of doing this. :) Thanks! David	[reply]

Replies are listed 'Best First'.

Re: Re: Code efficiency / algorithm
by dave8775 (Novice) on Jan 14, 2003 at 06:43 UTC

I posted some clarifications regarding what I am trying to do in my answer to the previous reply. Basically I am trying to match each number (e.g. 200210201) in datafile2 to any of the the given ranges in datafile1. If the number in datafile2 is within a range in datafile1 than it is a match. I used the concatenation of the two numbers in datafile1 to establish a unique 'rangeid' to be used as the hash key. In other words, '2 200534011 200577234 some text description' has a hash id of 20052. I then read each line of datafile2 and check through the record set to see if I find a hash key that matches (has the same 4 first numbers--i.e.2005). If so then I look closer at the sid and eid.

I am just trying to find out better, more efficient ways of doing this. :)

Thanks!

David

[reply]