Recently, I was working on a lookup type problem. I think that I came up with a good solution. Its works and is more than fast enough. But, it isn't the easiest to maintain and since this was the first time I've encountered this type of problem, I wonder if I haven't missed something.
I'm looking for some input on the work. I would like to know how others might have solved this or similar problems? Are there other ways to approach the problem? Are there more elegant ways?
Anyway, the problem breaks down like this. I have to classify a data record into one of 13 categories based on a specific field value in the record and the event date associated with the field. One constraint is that the cross reference list changes yearly and the correct reference list must be applied to the field. Another constraint is that a single data record may require access to cross reference lists from multiple years. That is, field 1 requires the 2001 list while field 3 requires the 2003 list.
The big catch is that the data reported are usually not specifically correct. The value reported won't exactly match a key in the cross reference list. Fortunately, most of the categorizations can be done through subsets of the field value. for exampe a key value of "032.435.4A1" might be correctly classified through the subset "032.435" . I've built these subset groups into the cross reference list. Typically, a record might require some 30-42 (average 32.4) queries against this list before a sufficient match is found.
After the initial lookup, there is an override lookup that checks the field against a manually constructed override table that may reclassify the results. avg 3 lookups
OK, that's the set up. Here is the solution
My ultimate solution moved the lookup tables into a series of SDBM files identified by the lookup year (20030701, 20040701 etc). These I put in a seperate subdirectory and the main program then opens and ties these files putting the tie reference into an array. The files are opened the first time the date index occurs although the program does create a list of files at startup.
The advantage to this scheme, as I see it,
I don't have a lot of unnecessary open files around. Only some of the lookup years will be referenced in any given run
The lookup is very fast, pretty much on the same order as a straight, in memory hash lookup
It is easy to expand. I just add a new SDBM file to the directory for the next year and off we go.
Some disadvantages to this scheme
It is a bit of a pain to update. The lists are created from a text file. A change requires a complete regeneration of the SDBM file.
Not much of an audit trail when changes are made (a comment in the text file)
Not interactive. Sometimes it would be helpfull to be able to manually query the lookup lists.
Given all of that,
Should I be all that concerned with having 15-20 of these tied files open at one time?
Would it be faster/more efficient to open them all at once?
Each lookup table is pretty small, would there be any benefit to reading them into memory? (Some early tests indicated that the read in time was significant but would the benefit outway the cost?)
Is there another way to make these lookups more efficient, SQLLite maybe?
Any suggestions on how to make maintenance of these cross reference lists easier?
I initially tried a database but the process was too slow. The subsetting made indexing difficult and unidexed table lookups were way to slow. Even with full indexing, ODBC overhead was just too much even on relatively small runs of 10,000 or so records. A considerable amount of other processing occurs at the same time and the logistics of the process rules out runs of 10 hours or more. Benchmarking showed that a significant ~80% of program time was spend in the lookup part.
The speed of the algorithm (or the final program) is not really an issue at this point. The use of the SDBM files cut down processing time to about 1.2 seconds per record (from 6-13 seconds). Other optimizations brought it down further so that now, it is mostly the data base searches and updates that dominate the process now.
Here is some code
$rtncd = InitStatVars();
sub InitStatVars{
# this sub reads the lookup directory and returns a list of files pres
+ent
my $rtncd = 0;
my @tmp01 = ();
# get list of statute hash file
# strip of ext and sort alphanumerically
@tmp01 = sort map{ substr(File::Basename::basename($_),0,-4) }
glob($$cfg{basedir
+}.$$cfg{statdir}."*.pag");
if(scalar @tmp01 == 0){
# if there are no stat files return error
$rtncd = 2;
$logstr = "$cnty;$procdte;$fname;$module;1;error;[0003]Unable
+to intialize statute hash tables".
";InitStatVars(); ; ; ";
UpdtMetaLog($$cfg{logdir},$logfile,$logstr);
}else{
unshift @tmp01, MINDATE;
# recs before this date have no equiv file
}
/_exc$/ || push @statfilelst, $_ foreach (@tmp01);
# load statute file names to list
for (0..$#statfilelst){
#initialize file handle arrays
$statute[$_] = undef;
# one for each statute file
$statexc[$_] = undef;
}
return $rtncd;
} #end InitStatVars();
$fhidx = SetFileHandleIdx($event_dt);
sub SetFileHandleIdx{
# this sub crosses the date index to the correct array index
my $event_dt = shift;
my $fhidx = $#statfilelst;
# use most recent file as default
if(IsDateOK($event_dt)){
# determine which stat file
for (0..$#statfilelst){
$fhidx = $_ if($event_dt ge $statfilelst[$_]);
}
}else{
# if date is not valid use most current
$fhidx = $#statfilelst;
}
return $fhidx;
} #end SetFileHandleIdx();
$rtncd = OpenFileHandle($fhidx) unless(defined($statute[$fhidx]));
sub OpenFileHandle{
# this sub loads the tied hash reference
my $fhidx = shift;
my $rtncd = 0;
$statute[$fhidx] = {};
$rtncd = (tie(%{$statute[$fhidx]},
"SDBM_File",
$$cfg{basedir}.$$cfg{statdir}.$statfilelst[$fhidx],
O_RDONLY,
0666))
? 0 : 2;
if($rtncd == 2){
$statute[$fhidx] = undef;
$logstr = "$cnty;$procdte;$fname;$module;1;error;[0201]Could n
+ot tie hash files ".
"$statfilelst[$fhidx];OpenFileHandle(); ; ; ";
UpdtMetaLog($$cfg{logdir},$logfile,$logstr);
}
unless($rtncd){
if(-e "$$cfg{basedir}$$cfg{statdir}$statfilelst[$fhidx]"."_exc
+.pag"){
$statexc[$fhidx] = {};
$rtncd = (tie(%{$statexc[$fhidx]},
"SDBM_File",
$$cfg{basedir}.$$cfg{statdir}.$statfilelst[$
+fhidx].'_exc',
O_RDONLY,
0666))
? 0 : 2;
}
if($rtncd == 2){
$statexc[$fhidx] = undef;
$logstr = "$cnty;$procdte;$fname;$module;1;error;[0201]Cou
+ld not tie hash files ".
"$statfilelst[$fhidx];OpenFileHandle(); ; ; ";
UpdtMetaLog($$cfg{logdir},$logfile,$logstr);
}
}
return $rtncd;
} #end OpenFileHandle();
if(exists($statute[$fhidx]->{$testval})) { do things ...}
As I said, I am interested in what others might have done to solve this problem. I run ActiveState Perl 5.6.1 on a Win2K system. I look forward to your comments.