periapt has asked for the wisdom of the Perl Monks concerning the following question:

Recently, I was working on a lookup type problem. I think that I came up with a good solution. Its works and is more than fast enough. But, it isn't the easiest to maintain and since this was the first time I've encountered this type of problem, I wonder if I haven't missed something.

I'm looking for some input on the work. I would like to know how others might have solved this or similar problems? Are there other ways to approach the problem? Are there more elegant ways?

Anyway, the problem breaks down like this. I have to classify a data record into one of 13 categories based on a specific field value in the record and the event date associated with the field. One constraint is that the cross reference list changes yearly and the correct reference list must be applied to the field. Another constraint is that a single data record may require access to cross reference lists from multiple years. That is, field 1 requires the 2001 list while field 3 requires the 2003 list.

The big catch is that the data reported are usually not specifically correct. The value reported won't exactly match a key in the cross reference list. Fortunately, most of the categorizations can be done through subsets of the field value. for exampe a key value of "032.435.4A1" might be correctly classified through the subset "032.435" . I've built these subset groups into the cross reference list. Typically, a record might require some 30-42 (average 32.4) queries against this list before a sufficient match is found.

After the initial lookup, there is an override lookup that checks the field against a manually constructed override table that may reclassify the results. avg 3 lookups

OK, that's the set up. Here is the solution

My ultimate solution moved the lookup tables into a series of SDBM files identified by the lookup year (20030701, 20040701 etc). These I put in a seperate subdirectory and the main program then opens and ties these files putting the tie reference into an array. The files are opened the first time the date index occurs although the program does create a list of files at startup.

The advantage to this scheme, as I see it,
  • I don't have a lot of unnecessary open files around. Only some of the lookup years will be referenced in any given run
  • The lookup is very fast, pretty much on the same order as a straight, in memory hash lookup
  • It is easy to expand. I just add a new SDBM file to the directory for the next year and off we go.

  • Some disadvantages to this scheme
  • It is a bit of a pain to update. The lists are created from a text file. A change requires a complete regeneration of the SDBM file.
  • Not much of an audit trail when changes are made (a comment in the text file)
  • Not interactive. Sometimes it would be helpfull to be able to manually query the lookup lists.


  • Given all of that,
  • Should I be all that concerned with having 15-20 of these tied files open at one time?
  • Would it be faster/more efficient to open them all at once?
  • Each lookup table is pretty small, would there be any benefit to reading them into memory? (Some early tests indicated that the read in time was
  • significant but would the benefit outway the cost?)
  • Is there another way to make these lookups more efficient, SQLLite maybe?
  • Any suggestions on how to make maintenance of these cross reference lists easier?


  • I initially tried a database but the process was too slow. The subsetting made indexing difficult and unidexed table lookups were way to slow. Even with full indexing, ODBC overhead was just too much even on relatively small runs of 10,000 or so records. A considerable amount of other processing occurs at the same time and the logistics of the process rules out runs of 10 hours or more. Benchmarking showed that a significant ~80% of program time was spend in the lookup part.

    The speed of the algorithm (or the final program) is not really an issue at this point. The use of the SDBM files cut down processing time to about 1.2 seconds per record (from 6-13 seconds). Other optimizations brought it down further so that now, it is mostly the data base searches and updates that dominate the process now.

    Here is some code
    $rtncd = InitStatVars(); sub InitStatVars{ # this sub reads the lookup directory and returns a list of files pres +ent my $rtncd = 0; my @tmp01 = (); # get list of statute hash file # strip of ext and sort alphanumerically @tmp01 = sort map{ substr(File::Basename::basename($_),0,-4) } glob($$cfg{basedir +}.$$cfg{statdir}."*.pag"); if(scalar @tmp01 == 0){ # if there are no stat files return error $rtncd = 2; $logstr = "$cnty;$procdte;$fname;$module;1;error;[0003]Unable +to intialize statute hash tables". ";InitStatVars(); ; ; "; UpdtMetaLog($$cfg{logdir},$logfile,$logstr); }else{ unshift @tmp01, MINDATE; # recs before this date have no equiv file } /_exc$/ || push @statfilelst, $_ foreach (@tmp01); # load statute file names to list for (0..$#statfilelst){ #initialize file handle arrays $statute[$_] = undef; # one for each statute file $statexc[$_] = undef; } return $rtncd; } #end InitStatVars(); $fhidx = SetFileHandleIdx($event_dt); sub SetFileHandleIdx{ # this sub crosses the date index to the correct array index my $event_dt = shift; my $fhidx = $#statfilelst; # use most recent file as default if(IsDateOK($event_dt)){ # determine which stat file for (0..$#statfilelst){ $fhidx = $_ if($event_dt ge $statfilelst[$_]); } }else{ # if date is not valid use most current $fhidx = $#statfilelst; } return $fhidx; } #end SetFileHandleIdx(); $rtncd = OpenFileHandle($fhidx) unless(defined($statute[$fhidx])); sub OpenFileHandle{ # this sub loads the tied hash reference my $fhidx = shift; my $rtncd = 0; $statute[$fhidx] = {}; $rtncd = (tie(%{$statute[$fhidx]}, "SDBM_File", $$cfg{basedir}.$$cfg{statdir}.$statfilelst[$fhidx], O_RDONLY, 0666)) ? 0 : 2; if($rtncd == 2){ $statute[$fhidx] = undef; $logstr = "$cnty;$procdte;$fname;$module;1;error;[0201]Could n +ot tie hash files ". "$statfilelst[$fhidx];OpenFileHandle(); ; ; "; UpdtMetaLog($$cfg{logdir},$logfile,$logstr); } unless($rtncd){ if(-e "$$cfg{basedir}$$cfg{statdir}$statfilelst[$fhidx]"."_exc +.pag"){ $statexc[$fhidx] = {}; $rtncd = (tie(%{$statexc[$fhidx]}, "SDBM_File", $$cfg{basedir}.$$cfg{statdir}.$statfilelst[$ +fhidx].'_exc', O_RDONLY, 0666)) ? 0 : 2; } if($rtncd == 2){ $statexc[$fhidx] = undef; $logstr = "$cnty;$procdte;$fname;$module;1;error;[0201]Cou +ld not tie hash files ". "$statfilelst[$fhidx];OpenFileHandle(); ; ; "; UpdtMetaLog($$cfg{logdir},$logfile,$logstr); } } return $rtncd; } #end OpenFileHandle(); if(exists($statute[$fhidx]->{$testval})) { do things ...}


    As I said, I am interested in what others might have done to solve this problem. I run ActiveState Perl 5.6.1 on a Win2K system. I look forward to your comments.


    PJ
    unspoken but ever present -- use strict; use warnings; use diagnostics; (if needed)

    Replies are listed 'Best First'.
    Re: Dynamic Lookups on SDBM Files
    by jZed (Prior) on Aug 03, 2004 at 18:45 UTC
      You might try using DBD::DBM which comes in recent versions of DBI. It should work directly on your SDBM files (or other DBM files with or without MLDBM) and provide DBI/SQL access to them for greater ease of maintainability. If you go this route, I'd be interested to hear how it benchmarks against other methods you've tried.
        Hmmm, this might be helpful for those after the fact queries on the lookup hashes. I'll take a look at it. Thanks.

        PJ
        unspoken but ever present -- use strict; use warnings; use diagnostics; (if needed)
    Re: Dynamic Lookups on SDBM Files
    by waswas-fng (Curate) on Aug 03, 2004 at 22:00 UTC
      It is not really apparent n your post what the actual data and looks look like, but would it make more sense to dump this data in a database like mysql or postgres and then have the lookups defined as SQL? That way (of course depending what the actual lookup needs are) you could have one SQL statement for each type of lookup. I am thinking something like what you state is your cross reference list that changes may easily fit into a SQL join or subselect. Could you post an example of the dataset and the types of lookups you are doing (with expected results from the example set). This should give us a better understanding of the actual problem and you better output from us.


      -Waswas
        Good point. At this stage of the process, the data is being read from a text file. It is ultimately put into a data base but not right away. In the database itself, the fields to be cross referenced are not in a form suitable for a join or subselect. In fact, that was one of the problems with the original system. The categorization performed after the database was loaded was unacceptably slow (due to the need to match subsets) and somewhat error prone (due to poor system design).

        The data in the data base is part of a legacy system and it is not practical to alter its format as too much else depends on it. In terms of operations, I found it more efficient to perform the categorization before loading it into our data base. This allows me to transform the data from the input file into a form that lends itself to the categorization scheme and then transform the result into the legacy format with the categorizations already in place. This reduces the load operation to a simple insert/update rather than an insert/update followed by modification.

        PJ
        unspoken but ever present -- use strict; use warnings; use diagnostics; (if needed)