Searching and sorting a large DBF file

tyndyll has asked for the wisdom of the Perl Monks concerning the following question:

Hello everyone,

At the minute I'm trying to assemble a script that will take a DBF file and resort it according to the values in one column, then adding a counter to count those values (e.g. 1_3, 2_3, 3_3). This is my first "real" Perl programming experience so any help would be appreciated.

In this file there are a number of addresses, and these addresses are assigned a Mailsort code for posting. What I wish to do is to organise these codes into the least number first (i.e. where COUNT(CODE) is lowest). This is in a DBF file and there are about 25000 records in the file.

What I have so far is inputing the file, followed by extracting and counting the Mailsrot codes (sub extMSCode) and returning this in a hash. My problem now revolves around extracting the data out of the original file and inserting it into a new file in order. I am using XBase and DBD:Xbase (or maybe they're the same, I have no idea any more :( ) for this. The main problem is that it is INCREDIBLY slow (running for about 5mins yields about 60 - 70 addresses in the new file). I can dump the data out quicker using only the XBase module to write a count and then sorting in Excel.

Is this perhaps the best way to go or am I missing something. Is it slow mainly because its a large file and not a relational database? Any information would be appreciated

Thanks

Tyndyll

 
#!/usr/bin/perl -w 
use strict; 
use XBase; 
use DBI; 

#function prototypes 
sub extMSCode($); 



print "Opening Original file...\n"; 
my $dbf = new XBase "input.dbf" or die XBase->errstr; 
print "Creating New Table...\n"; 
my $new_dbf = XBase->create("name" => "output.dbf", 
                            "field_names" => [$dbf->field_names, "COUN
+TED","SEQUENCE"], 
                            "field_types" => [$dbf->field_types, "C","
+N"], 
                            "field_lengths" => [$dbf->field_lengths,20
+,10], 
                            "field_decimals" => [$dbf->field_decimals,
+undef,0] ) or die $dbf->errstr; 

my $dbh = DBI->connect("DBI:XBase:") or die $DBI::errstr; 

print "Extracting Mailsort Codes...\n"; 
my %ms_code = extMSCode($dbh); 

print "Writing data...\n\n"; 

my ($select_sql, $insert_sql); 
my $loop; 
my @fields = $dbf->field_names; 
foreach my $key (sort{$ms_code{$a}<=>$ms_code{$b}} keys(%ms_code)){ 
    #Sort by Count 
    $select_sql = "select * from input.dbf where MS_SELCODE = $key ORD
+ER BY MS_SELCODE"; 
    $insert_sql = qq{INSERT INTO output.dbf (".join(",",@fields).") VA
+LUES ('?)}; 
    my $sth_select = $dbh->prepare($select_sql); 
    my $sth_insert = $dbh->prepare($insert_sql); 
    $sth_select->execute(); 
    while(my @data = $sth_select->fetchrow_array()){ 
        my $temp = join("\',\'", @data); 
        $sth_insert->bind_param(1,join("\',\'", @data)); 
        $sth_insert->execute(); 
        print $loop++." \n"; 
    } 
    
} 
$dbh->disconnect; 
# Function Declarations and Code 

sub extMSCode($){ 
    # Extracts Mailsort Codes and returns in a hash 
    # with a count per code 
    
    my $code_index = $dbh->prepare("select MS_SELCODE from input.dbf o
+rder by MS_SELCODE") or die $dbh->errstr(); 
    $code_index->execute() or die $code_index->errstr(); 
    
    my %ms_code = (); 
    while(my @data = $code_index->fetchrow_array()){ 
        if(defined($ms_code{$data["MS_SELCODE"]})){ 
            $ms_code{$data["MS_SELCODE"]}++; 
        }else{ 
            $ms_code{$data["MS_SELCODE"]} = 1; 
        } 
    } 
    return %ms_code; 
}
[download]

Comment on Searching and sorting a large DBF file Download Code

Replies are listed 'Best First'.
Re: Searching and sorting a large DBF file by bart (Canon) on May 13, 2004 at 18:26 UTC
First of all, do you have an index for your table? If not, that might explain why looking up a single record takes a lot of time. But anyway, a mere 25000 records is simple to hold into memory in Perl all at once. So my advice is: `select` all data from the table, and read it all into a Perl data structure, for example, HoH, where the keys are your key field, and the values are the records, as hashes. That will not take more than a few seconds. Next, go through this structure one at a time, and insert the data into the output database. I've noticed that outputting all data to a text file, like CSV, may be quite a bit swifter, which you can then import into your dB using an import tool.	[reply]
Re: Searching and sorting a large DBF file by tyndyll (Novice) on May 13, 2004 at 18:55 UTC
An index? I'm not sure what you mean. One thing I have thougth of is as well as doing a count of the sort codes building a hash structure with the row number and the sort code - is this what you mean? Out of curiosity what is "too large" a hash? (coming from a web background where i'm used to trying to make everything as small as possible) Thanks for your quick replay too tyndyll	[reply]
Re: Re: Searching and sorting a large DBF file by exussum0 (Vicar) on May 13, 2004 at 20:03 UTC
An index is a database object used for doing fast lookups on particular pieces of data. i.e. `select column1,colum2 from table1 where column1 = 'bob'` [download] would be slow if there were a million records. But, if there was a structure that was faster to do a lookup on, an index, existed for this table, lookups would be faster. Indexes are seperate from tables as they are optimized for specific pieces of data. To use an index that you create, look up your db manual, just use it in the where clause somehow. `select col1,col2 from table1,table2 where table1.col1='bob' and table2.col2=table1.col2` [download] This would join fast if there was an index on col2, and would also be fast on finding 'bob' if there was an index on col1. Further more, indexes don't work well when you transform data. Doing ... `select * from table where col1*2 = 50` [download] would never use an index, as you are comparing 50 to mutated data. If you did co1=50, it would be fine. There's an index on the values of the table, not the values that you mutate like above. Humma la beeba la seeba la booba la Humma la beeba la seeba la bop	[reply] [d/l] [select]