comment on

I have seen lots of reference, both in books and on Perl Monks to the fact that hashes are slow. I know hashes are slow, but in my opinion they provide an easier method of accessing complex data structures. I have seen code that uses constants with arrays and that helps make reading the code a little easier for me, but not flexible enough for me. So I like to use hashes and try to make code that will make them as fast as possible when used in the same context as their array counter part. (I hope :)

Slow Hash

use DBI;
use Benchmark;
use strict;
use warnings;

my $dsn = "dbi:mysql:database";
my $dbh = DBI->connect("$dsn",'user','password', {
                        RaiseError => 1,

            }
            );
            
my $statement = qq~SELECT field1,
 field2,
 field3 FROM table~;         

my $key_field = 'table_id';

timethese( 1_000, { 
    array_way => sub { 
          $ary_ref  = $dbh->selectall_arrayref($statement)
          },
    hash_way  => sub { 
          $hash_ref = $dbh->selectall_hashref($statement, $key_field)
          }
                } );
[download]

For my test datasource the array_way was twice as fast, which makes it the best way if we are going to be executing this code over and over again with drastically different SQL statements.
Array usage *might* lead to maintainance issues in some cases. For example we are building a dynamic data structure and we want to remove one column of results for a particular type of viewer. We would rather not change the SQL statement or the SQL is provided by a third party (that is an SQL query we can't control for some reason). We need to pass which column to remove via our function. Lets create a function called show_all_except

# is passed two items, 
# 1 - the data results in either a hash ref or array ref
# 2 - the ignore list as an anonymous array or hash
# 
# Print statements are commented for benchmarking
#

sub show_all_except {
    my ($dr,$ignore) = @_;
    my $output;
    if (ref($ignore) eq 'ARRAY') {
        for my $row (@{$dr}) {
            
            ROW: for my $count (0..@{$row}) {
                for (0..@{$ignore}) { 
                    if ($count == $_) {
                        next ROW;
                    }
                }
                $output .= "$row->[$count] ";
            }
            $output .= "\n";
        }
    } elsif(ref($ignore) eq 'HASH') {
        for my $row (%{$dr}) {
            for (keys %{$dr->{$row}}) {
                next if $ignore->{$_};
                $output .= "$_ = $dr->{$row}{$_} ";
            }
            $output .= "\n";
        }
    }
    return $output;
}
[download]

Now lets say we want to remove 'field1' so in an array context we would pass show_all_except($ary_ref,[0]). That would tell the function to exclude the first column, but what if the SQL query is changed in the future and 0 is no longer equal to field1? This is where a hash (w|c)ould become more flexible or if you will durable. The function would now get:

show_all_except($hash_ref,{ 'field1' => 1 }); This doesn't really speed up the process of handling a hash though.

Better Hash Handling (YMMV)

But this presents us with a problem, the columns aren't in the order we wanted them. So wouldn't it be nice if we just passed an array ref of the fields we want instead of the ones to ignore?
So our show_all_except is replaced with show_these

show_these($hash_ref,[ field3, field2, field1 ]);

sub show_these {
    my ($dr,$show) = @_;
    my $output;
    for my $row (keys %{$dr}) {
        for (@{$show}) {
            next if !$dr->{$row}{$_};
            $output .= qq!$_ = $dr->{$row}{$_} !;
        }
        $output .= "\n";
    }        
    return $output;
}
[download]

Benchmark for show_all_except with an array averages ~5/s and the show_these averages around ~5.75/s this is a far cry from out first results that showed using the hash was twice as slow. Now if we cache our results we can reduce this even more. This is a rather simple cache system for this discussion.

my $ary_ref;
my $hash_ref;
my $ary_out;
my $hash_out;

   timethese( 1_000, { 
    array_way => sub { 
          if (!$ary_ref) {
            $ary_ref  = $dbh->selectall_arrayref($statement);
            $ary_out = show_all_except($ary_ref,[0]);
          };
          print $ary_out;
          },
    hash_way  => sub { 
          if (!$hash_ref) {
            $hash_ref = $dbh->selectall_hashref($statement, $key_field
+);
            $hash_out = show_these($hash_ref,[ 'field1','field2','fiel
+d3'  ]);
          };
          print $hash_out;
          }
            } );
[download]

This yields of course a huge time savings in running the code since the data is already available in a string and no additional function/database/data structure creation is needed.
We still have at least one problem that someone is going to complain about. Using the keys for the hash display will not keep our order from the SQL query. So we can modify our show_these to accept a call back for sorting.

sub show_these {
    my ($dr,$show,$sort) = @_;
    my $output;
    if (!$sort) { $sort = sub {
        my $id = shift;
        return sort { $a <=> $b } keys %{$dr}
                }
    }
    my (@order) = $sort->($dr);
    for my $row (@order) {
        for (@{$show}) {
            next if !$dr->{$row}{$_};
            $output .= qq!$_ = $dr->{$row}{$_} !;
        }
        $output .= "\n";
    }        
    return $output;
}
[download]

Now when we access it we can send an additional sort to it, like this:

    my $sort =     
    sub { 
        my $rhash = shift;
        return sort { 
            $rhash->{$a}{'field3'} cmp $rhash->{$b}{'field3'} 
            } keys %{$rhash}
    };
    $hash_out = show_these($hash_ref,[ 'field1','field2','field3'  ],$
+sort);
[download]

Now we don't need to worry about the SQL statement changes in the future, even if that column doesn't appear in the query no harm done*, it is skipped.
What does this benchmark compared to the array method**?
array_way ~5/s
hash_way ~5.88/s
no caching and everything is sorted they way we want it.
What, other than a speed improvement, does this this give us? For me freedom, readability, and easier maintenance. We are no longer tied to the SQL or other means to organize and represent our data ( assuming we provide another callback on what to wrap/pass each row with, but that is getting way to deep into it for this meditation ) in an independent way. Per page (web app) or application we can change how data is retrieved and access in a some what durable fashion. The customer changes his needs for the display you simply change the order of the array ref that is being passed into the show_these for the order, your SQL is not touched. This keeps the information needed at the display level accessibly and allows for the SQL to be abstracted/hidden/modularized in a cleaner fashion if so desired.

Caching

Caching can help us if we think the data is going to be consistent for at least some period of time. Length of time depends on application requirements, such as database updates occur every 30 minutes and in the 30 minutes between we have 1000 request for a particular set of data. If the data changes every second then caching techniques are usually worthless. I am not going to go in depth into the different types of caching, however I do hope that others will share what has worked for them. Some possible caching techniques are:

Matching SQL queries, if the same query comes in and we know, by some flag or variable, that the data has not changed then we reuse the results of the previous request.
If the query doesn't change then we just test the value the last query stored into our variable. So what do we do if we aren't in a persistent environment? We can, depending on size, write results to disk via Storable as well depending on amount of access the system receives and the size/speed of the database source.

There are numerous other methods discussed in various books and flame wars on mailing lists about this, but it is my opinion that hashes make the maintainers life easier in the long run. Since hashes are slow do everything you can to optimize your code knowing this.
There are also several Caching modules on CPAN as well.

DISCLAIMER: I adopted this way of thinking because at a past place of employment the requirements were constantly being changed and it was difficult to keep the presentation work up to speed. If you are working in a more stable enivronment I recommend using arrays for the most part because they will generally be faster without as much tweaking.

* ok there can be harm done if it effects your table layout which has happened to me, but there are no errors and hopefully we aren't making blind changes to live critical apps, right?

** I know the array method *could* be optimized, but I am selling hashes here so give me a break :^)

I see you coming jeffa

UPDATE: Removed the PrintError from the connect statement another monk pointed out that it was not needed with RaiseError.

In reply to Using hashes instead of arrays with database results by trs80

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.