Using hashes instead of arrays with database results

I have seen lots of reference, both in books and on Perl Monks to the fact that hashes are slow. I know hashes are slow, but in my opinion they provide an easier method of accessing complex data structures. I have seen code that uses constants with arrays and that helps make reading the code a little easier for me, but not flexible enough for me. So I like to use hashes and try to make code that will make them as fast as possible when used in the same context as their array counter part. (I hope :)

Slow Hash

use DBI;
use Benchmark;
use strict;
use warnings;

my $dsn = "dbi:mysql:database";
my $dbh = DBI->connect("$dsn",'user','password', {
                        RaiseError => 1,

            }
            );
            
my $statement = qq~SELECT field1,
 field2,
 field3 FROM table~;         

my $key_field = 'table_id';

timethese( 1_000, { 
    array_way => sub { 
          $ary_ref  = $dbh->selectall_arrayref($statement)
          },
    hash_way  => sub { 
          $hash_ref = $dbh->selectall_hashref($statement, $key_field)
          }
                } );
[download]

For my test datasource the array_way was twice as fast, which makes it the best way if we are going to be executing this code over and over again with drastically different SQL statements.
Array usage *might* lead to maintainance issues in some cases. For example we are building a dynamic data structure and we want to remove one column of results for a particular type of viewer. We would rather not change the SQL statement or the SQL is provided by a third party (that is an SQL query we can't control for some reason). We need to pass which column to remove via our function. Lets create a function called show_all_except

# is passed two items, 
# 1 - the data results in either a hash ref or array ref
# 2 - the ignore list as an anonymous array or hash
# 
# Print statements are commented for benchmarking
#

sub show_all_except {
    my ($dr,$ignore) = @_;
    my $output;
    if (ref($ignore) eq 'ARRAY') {
        for my $row (@{$dr}) {
            
            ROW: for my $count (0..@{$row}) {
                for (0..@{$ignore}) { 
                    if ($count == $_) {
                        next ROW;
                    }
                }
                $output .= "$row->[$count] ";
            }
            $output .= "\n";
        }
    } elsif(ref($ignore) eq 'HASH') {
        for my $row (%{$dr}) {
            for (keys %{$dr->{$row}}) {
                next if $ignore->{$_};
                $output .= "$_ = $dr->{$row}{$_} ";
            }
            $output .= "\n";
        }
    }
    return $output;
}
[download]

Now lets say we want to remove 'field1' so in an array context we would pass show_all_except($ary_ref,[0]). That would tell the function to exclude the first column, but what if the SQL query is changed in the future and 0 is no longer equal to field1? This is where a hash (w|c)ould become more flexible or if you will durable. The function would now get:

show_all_except($hash_ref,{ 'field1' => 1 }); This doesn't really speed up the process of handling a hash though.

Better Hash Handling (YMMV)

But this presents us with a problem, the columns aren't in the order we wanted them. So wouldn't it be nice if we just passed an array ref of the fields we want instead of the ones to ignore?
So our show_all_except is replaced with show_these

show_these($hash_ref,[ field3, field2, field1 ]);

sub show_these {
    my ($dr,$show) = @_;
    my $output;
    for my $row (keys %{$dr}) {
        for (@{$show}) {
            next if !$dr->{$row}{$_};
            $output .= qq!$_ = $dr->{$row}{$_} !;
        }
        $output .= "\n";
    }        
    return $output;
}
[download]

Benchmark for show_all_except with an array averages ~5/s and the show_these averages around ~5.75/s this is a far cry from out first results that showed using the hash was twice as slow. Now if we cache our results we can reduce this even more. This is a rather simple cache system for this discussion.

my $ary_ref;
my $hash_ref;
my $ary_out;
my $hash_out;

   timethese( 1_000, { 
    array_way => sub { 
          if (!$ary_ref) {
            $ary_ref  = $dbh->selectall_arrayref($statement);
            $ary_out = show_all_except($ary_ref,[0]);
          };
          print $ary_out;
          },
    hash_way  => sub { 
          if (!$hash_ref) {
            $hash_ref = $dbh->selectall_hashref($statement, $key_field
+);
            $hash_out = show_these($hash_ref,[ 'field1','field2','fiel
+d3'  ]);
          };
          print $hash_out;
          }
            } );
[download]

This yields of course a huge time savings in running the code since the data is already available in a string and no additional function/database/data structure creation is needed.
We still have at least one problem that someone is going to complain about. Using the keys for the hash display will not keep our order from the SQL query. So we can modify our show_these to accept a call back for sorting.

sub show_these {
    my ($dr,$show,$sort) = @_;
    my $output;
    if (!$sort) { $sort = sub {
        my $id = shift;
        return sort { $a <=> $b } keys %{$dr}
                }
    }
    my (@order) = $sort->($dr);
    for my $row (@order) {
        for (@{$show}) {
            next if !$dr->{$row}{$_};
            $output .= qq!$_ = $dr->{$row}{$_} !;
        }
        $output .= "\n";
    }        
    return $output;
}
[download]

Now when we access it we can send an additional sort to it, like this:

    my $sort =     
    sub { 
        my $rhash = shift;
        return sort { 
            $rhash->{$a}{'field3'} cmp $rhash->{$b}{'field3'} 
            } keys %{$rhash}
    };
    $hash_out = show_these($hash_ref,[ 'field1','field2','field3'  ],$
+sort);
[download]

Now we don't need to worry about the SQL statement changes in the future, even if that column doesn't appear in the query no harm done*, it is skipped.
What does this benchmark compared to the array method**?
array_way ~5/s
hash_way ~5.88/s
no caching and everything is sorted they way we want it.
What, other than a speed improvement, does this this give us? For me freedom, readability, and easier maintenance. We are no longer tied to the SQL or other means to organize and represent our data ( assuming we provide another callback on what to wrap/pass each row with, but that is getting way to deep into it for this meditation ) in an independent way. Per page (web app) or application we can change how data is retrieved and access in a some what durable fashion. The customer changes his needs for the display you simply change the order of the array ref that is being passed into the show_these for the order, your SQL is not touched. This keeps the information needed at the display level accessibly and allows for the SQL to be abstracted/hidden/modularized in a cleaner fashion if so desired.

Caching

Caching can help us if we think the data is going to be consistent for at least some period of time. Length of time depends on application requirements, such as database updates occur every 30 minutes and in the 30 minutes between we have 1000 request for a particular set of data. If the data changes every second then caching techniques are usually worthless. I am not going to go in depth into the different types of caching, however I do hope that others will share what has worked for them. Some possible caching techniques are:

Matching SQL queries, if the same query comes in and we know, by some flag or variable, that the data has not changed then we reuse the results of the previous request.
If the query doesn't change then we just test the value the last query stored into our variable. So what do we do if we aren't in a persistent environment? We can, depending on size, write results to disk via Storable as well depending on amount of access the system receives and the size/speed of the database source.

There are numerous other methods discussed in various books and flame wars on mailing lists about this, but it is my opinion that hashes make the maintainers life easier in the long run. Since hashes are slow do everything you can to optimize your code knowing this.
There are also several Caching modules on CPAN as well.

DISCLAIMER: I adopted this way of thinking because at a past place of employment the requirements were constantly being changed and it was difficult to keep the presentation work up to speed. If you are working in a more stable enivronment I recommend using arrays for the most part because they will generally be faster without as much tweaking.

* ok there can be harm done if it effects your table layout which has happened to me, but there are no errors and hopefully we aren't making blind changes to live critical apps, right?

** I know the array method *could* be optimized, but I am selling hashes here so give me a break :^)

I see you coming jeffa

UPDATE: Removed the PrintError from the connect statement another monk pointed out that it was not needed with RaiseError.

Comment on Using hashes instead of arrays with database results Select or Download Code

Replies are listed 'Best First'.
Re (tilly) 1: Using hashes instead of arrays with database results by tilly (Archbishop) on Jan 29, 2002 at 09:12 UTC
Three observations. The first is that if you introduce a little latency between your database server and your client machine (eg a firewall between the two, some load-balancing software, etc), the effect of that can rapidly dwarf the overhead of hashes. (True story. A database loading script for Sybase I was working on slowed massively. The problem? The rows got too big for the packets, and passing data slowed massively.) Do you know where your bottlenecks are? Immediately leaping to use arrays rather than hashes is an excellent example of premature optimization. Depending on your queries, the simple act of organizing your tables well and adding the right indexes can make a far larger difference than the act of accessing data. If it is a user visible request, how many hash accesses are you really going to do? How much string manipulation are you going to do right afterwards that is just as inefficient? Which all comes back to the fundamental point that any kind of optimization doesn't matter unless you need it. It sounds silly when said that way, but it is true. Either you need to hit a performance mark, in which case you are really, really sad if you don't, or you don't have a problem in which case it didn't matter. There isn't much ground between that. So yes. Arrays are faster than hashes. Wonderful. If I was going to work on a data warehouse, I might care. But I don't. So as long as I can get my measly 80,000 row tables calculated and loaded from scratch in a night, I don't care about efficiency. Hmm, I am nowhere close to having a problem now, how about later? Well in 5 years they will be 160,000 row tables, and hardware will be 10x as fast. I am failing to see a problem here. Now if you are pushing the envelope, then disregard this. There are people who work with data warehouses and need to really understand performance. (Though competent ones seem to make hard decisions about what is penny wise and pound foolish pretty ruthlessly. A friend of mine told me how he used closures pretty ruthlessly as an abstraction technique in a loading script. The overhead of calling all of those functions took an extra $30K of hardware, the project got delivered 6 weeks sooner, it was worth it.) Likewise high performance websites (which I thankfully have little performance with - nor do I want to) have their own needs and may care. But the average corporate intranet site gets how many hits per minute? Again, if it is an issue for you, then worry about it. It isn't for most of us though. PS Please don't mistake my attitude for an implicit admission that I am unable to get performance when I want it. Ask jeffa about that. :-)	[reply]
Re: Re (tilly) 1: Using hashes instead of arrays with database results by mpeppler (Vicar) on Jan 31, 2002 at 17:05 UTC
True story. A database loading script for Sybase I was working on slowed massively. The problem? The rows got too big for the packets, and passing data slowed massively. I'm sure you know this - but you can configure the Sybase database server to handle larger packets (up to 8k, I think), and then the client apps can use this as well (I think DBD::Sybase can do this - if not I'll have to add it in!). It's a potentially huge performance increase for certain types of applications... Michael	[reply]
Re (tilly) 3: Using hashes instead of arrays with database results by tilly (Archbishop) on Jan 31, 2002 at 17:15 UTC
Thanks, I did know that. The setting in DBD::Sybase is already there, it is called packetSize. Doesn't help if your OS setting don't support it, or if you don't control the database...	[reply]
Re: Using hashes instead of arrays with database results by gav^ (Curate) on Jan 29, 2002 at 10:01 UTC
I tend to either use hashes or bound values more than arrays anyway. I use HTML::Template quite a bit and often use the hashref/arrayref calls to match up with a template. For example: `my @nodes = (); my $sth = $dbh->prepare(q{ SELECT node_id, title, body FROM node WHERE parent_id = ? }); $sth->execute($parent_id); while (my $row = $sth->fetchrow_hashref) { push @nodes, $row; } $sth->finish; foreach my $node (@nodes) { $sth->execute($node->{node_id}); $node->{children} = $sth->fetchall_arrayref({ node_id => 1, title => 1 }); $sth->finish; } $t->param(NODES => \@nodes);` [download] This is from some code that I'm hacking about at the moment, this matches up to a nested <TMPL_LOOP> in the template. It's not designed for performance. If it isn't fast enough then there is loads of things I'd do before changing hashes to arrays... gav^	[reply] [d/l]
(jeffa) 2Re: Using hashes instead of arrays with database results by jeffa (Bishop) on Jan 31, 2002 at 15:58 UTC
You hit the nail on the head! gav^++ Having discussed this issue a bit further with trs80, i strongly feel that he should be using HTML::Template instead of jumping through hoops do suppress the display of one or more columns. {chuckle} And what data structure does HTML::Template want for its loops? A list of hashes. ;) jeffa L-LL-L--L-LL-L--L-LL-L-- -R--R-RR-R--R-RR-R--R-RR B--B--B--B--B--B--B--B-- H---H---H---H---H---H--- (the triplet paradiddle with high-hat)	[reply]
(jeffa) Re: Using hashes instead of arrays with database results by jeffa (Bishop) on Jan 29, 2002 at 09:25 UTC
:) Yes, i mentioned that i tend to avoid using hashes to store database result sets, but that doesn't mean that i avoid hashes in database code. One of my favorite tricks is to store the names of columns in an array and a hash: `my $sth = $dbh->prepare('select stdout,stdin from stderr'); $sth->execute; my @field = @{$sth->{'NAME'}}; my %field = map { $field[$_] => $_ } (0..$#field); my $rows = $sth->fetchall_arrayref();` [download] Now i know what belongs where, but have other issues with your code, namely ~~printing inside a sub like that~~ i meant - storing the 'print' results in a string and returning that. How about transforming the data structure into another one and then having one printing subroutine? The obvious drawback is memory consuption, of course. My other problem is sorting outside the database. I am sure there is at least one situation 'out there' that calls for it, but most of the time you should let the database do it. But, good stuff none-the-less. I am not going to refute it. I'll leave that for princepawn, whom you didn't see coming. ;) jeffa L-LL-L--L-LL-L--L-LL-L-- -R--R-RR-R--R-RR-R--R-RR B--B--B--B--B--B--B--B-- H---H---H---H---H---H--- (the triplet paradiddle with high-hat)	[reply] [d/l]