I have seen lots of reference, both in books and on Perl
Monks to the fact that hashes are slow. I know hashes
are slow, but in my opinion they provide an easier method of
accessing complex data structures. I have seen code that
uses
with arrays and that helps make reading
the code a little easier for me, but not flexible enough for me.
So I like to use hashes and try to make code that will make
them as fast as possible when used in the same context as
their array counter part. (I hope :)
For my test datasource the array_way was twice as fast, which
makes it the best way if we are going to be executing this
code over and over again with drastically different SQL statements.
Array usage *might* lead to maintainance issues in
some cases. For example we are building a dynamic
data structure and we want to remove one column of results for
a particular type of viewer. We would rather not change
the SQL statement or the SQL is provided by a third party
(that is an SQL query we can't control for some reason).
We need to pass which column to remove via our function.
Lets create a function called
show_all_except
# is passed two items,
# 1 - the data results in either a hash ref or array ref
# 2 - the ignore list as an anonymous array or hash
#
# Print statements are commented for benchmarking
#
sub show_all_except {
my ($dr,$ignore) = @_;
my $output;
if (ref($ignore) eq 'ARRAY') {
for my $row (@{$dr}) {
ROW: for my $count (0..@{$row}) {
for (0..@{$ignore}) {
if ($count == $_) {
next ROW;
}
}
$output .= "$row->[$count] ";
}
$output .= "\n";
}
} elsif(ref($ignore) eq 'HASH') {
for my $row (%{$dr}) {
for (keys %{$dr->{$row}}) {
next if $ignore->{$_};
$output .= "$_ = $dr->{$row}{$_} ";
}
$output .= "\n";
}
}
return $output;
}
Now lets say we want to remove 'field1' so in an array
context we would pass show_all_except($ary_ref,[0]). That would tell
the function to exclude the first column, but what if the
SQL query is changed in the future and 0 is no longer equal
to field1? This is where a hash (w|c)ould become more flexible
or if you will durable. The function would now get:
show_all_except($hash_ref,{ 'field1' => 1 });
This doesn't really speed up the process of handling a hash
though.
Better Hash Handling (YMMV)
But this presents us with a problem, the columns aren't in the order
we wanted them. So wouldn't it be
nice if we just passed an array ref of the fields
we want instead of the ones to ignore?
So our
show_all_except
is replaced with
show_these
show_these($hash_ref,[ field3, field2, field1 ]);
sub show_these {
my ($dr,$show) = @_;
my $output;
for my $row (keys %{$dr}) {
for (@{$show}) {
next if !$dr->{$row}{$_};
$output .= qq!$_ = $dr->{$row}{$_} !;
}
$output .= "\n";
}
return $output;
}
Benchmark for
show_all_except with an array averages
~5/s and the
show_these averages around ~5.75/s this is
a far cry from out first results that showed using the hash
was twice as slow. Now if we cache our results we can
reduce this even more. This is a rather simple cache
system for this discussion.
my $ary_ref;
my $hash_ref;
my $ary_out;
my $hash_out;
timethese( 1_000, {
array_way => sub {
if (!$ary_ref) {
$ary_ref = $dbh->selectall_arrayref($statement);
$ary_out = show_all_except($ary_ref,[0]);
};
print $ary_out;
},
hash_way => sub {
if (!$hash_ref) {
$hash_ref = $dbh->selectall_hashref($statement, $key_field
+);
$hash_out = show_these($hash_ref,[ 'field1','field2','fiel
+d3' ]);
};
print $hash_out;
}
} );
This yields of course a huge time savings in running the code
since the data is already available in a string and no additional
function/database/data structure creation is needed.
We still have at least one problem that someone is going to
complain about. Using the keys for the hash display will
not keep our order from the SQL query. So we can modify our
show_these to accept a call back for sorting.
sub show_these {
my ($dr,$show,$sort) = @_;
my $output;
if (!$sort) { $sort = sub {
my $id = shift;
return sort { $a <=> $b } keys %{$dr}
}
}
my (@order) = $sort->($dr);
for my $row (@order) {
for (@{$show}) {
next if !$dr->{$row}{$_};
$output .= qq!$_ = $dr->{$row}{$_} !;
}
$output .= "\n";
}
return $output;
}
Now when we access it we can send an additional sort to it,
like this:
my $sort =
sub {
my $rhash = shift;
return sort {
$rhash->{$a}{'field3'} cmp $rhash->{$b}{'field3'}
} keys %{$rhash}
};
$hash_out = show_these($hash_ref,[ 'field1','field2','field3' ],$
+sort);
Now we don't need to worry about the SQL statement changes
in the future, even if that column doesn't appear in the
query no harm done*, it is skipped.
What does this benchmark compared to the array method**?
array_way ~5/s
hash_way ~5.88/s
no caching and everything is sorted they way we want it.
What, other than a speed improvement, does this this give us?
For me freedom, readability, and easier maintenance. We are no
longer tied to the SQL or other means to organize and
represent our data ( assuming we provide another callback on
what to wrap/pass each
row with, but that is getting way to deep into it for this
meditation ) in an independent way. Per page (web app) or
application we can change how data is retrieved and access
in a some what durable fashion. The customer changes his
needs for the display you simply change the order of the
array ref that is being passed into the show_these for the
order, your SQL is not touched. This keeps the information
needed at the display level accessibly and allows for the SQL
to be abstracted/hidden/modularized in a cleaner fashion if
so desired.
Caching
Caching can
help us if we think the data is going to be consistent
for at least some period of time. Length of time depends on
application requirements, such as database updates occur every
30 minutes and in the 30 minutes between we have 1000 request
for a particular set of data. If the data changes every
second then caching techniques are usually worthless. I am not going
to go in depth into the different types of caching, however
I do hope that others will share what has worked for them.
Some possible caching techniques are:
- Matching SQL queries, if the same query comes in and we
know, by some flag or variable, that the data has not changed
then we reuse the results of the previous request.
- If the query doesn't change then we just test the
value the last query stored into our variable.
So what do we do if we aren't in a persistent environment?
We can, depending on size, write results to disk
via Storable as well depending on amount of access the system
receives and the size/speed of the database source.
There are numerous other methods discussed in various books
and flame wars on mailing lists about this, but it is my
opinion that hashes make the maintainers life easier in the
long run. Since hashes are slow do everything you can to
optimize your code knowing this.
There are also several Caching modules on CPAN as well.
DISCLAIMER: I adopted this way of thinking because at a past
place of employment the requirements were constantly being
changed and it was difficult to keep the presentation work
up to speed. If you are working in a more stable enivronment
I recommend using arrays for the most part because they will
generally be faster without as much tweaking.
* ok there can be harm done if it effects your table layout
which has happened to me, but there are no errors and hopefully
we aren't making blind changes to live critical apps, right?
** I know the array method *could* be optimized, but I am
selling hashes here so give me a break :^)
I see you coming
jeffa
UPDATE: Removed the PrintError from the connect
statement another monk pointed out that it was not needed
with RaiseError.