comment on

Hello,

I've started to hit a wall with SQL and sort-of had this idea of moving a certain piece of functionality out of the SQL query and into something which operates more quickly. Speed is vital here, as I will explain.

First of all, what I am doing involves running a query against a data table and joining/grouping that result by what is stored in a category table. Thus giving me a count of all the items in the data table which fall into a given category.

Something like:

category|count
--------|-----
a       |22
a.b     |10
a.b.a   |2
a.b.a.a |1
a.b.a.b |1
a.b.b   |8
a.b.b.a |6
a.b.b.b |2
a.c     |12
a.c.a   |4
a.c.a.a |2
a.c.a.b |2
a.c.b   |8
a.c.b.a |6
a.c.b.b |2
[download]

.. I just typed this, it is not real data.. Hopefully I got all the math right and things add up correctly.

While doing this sort of query works well for smaller tables, the data has grown dramatically over the past several months and I now have close to 100,000 rows in this table. On my fast production systems, performance is "adequate" (2 or 3 seconds). On my incredibly slow development system it is unacceptable (30+ seconds). What I am trying to achieve is "adequate" performance on my development system.

One way I thought about doing this was to have a separate process/server running that when fed a list of IDs would look up each of them in its cache, requerying the database for any which have grown stale/etc.

The first order of business was to determine if this mountain can be moved. My test script is:

use strict;
use Data::Dumper;
use Time::HiRes qw(time);
print STDERR "Building cats\n";
my %cats = ();
foreach my $one ('a'..'d')
{
  $cats{"a.$one"} = "a.$one";
  foreach my $two ('a'..'z')
  {
    $cats{"a.$one.$two"} = "a.$one.$two";
    foreach my $three ('a'..'z')
    {
      $cats{"a.$one.$two.$three"} = "a.$one.$two.$three";
    }
  }
}
my @keys = sort keys %cats;
my $cnt = scalar @keys;
print STDERR "Building array from $cnt possibilities\n";
my $n = 100000;
my %ids = ();
foreach my $id (0..$n)
{
  my $rand = int(rand($cnt));
  $ids{$id} = $cats{$keys[$rand]};
}
@keys = keys %ids;
$cnt = scalar @keys;
my %ret = ();
my $x = 100000;
my $start_time = time();
print STDERR "Returning mapped categories for values from a pool of $c
+nt\n";
foreach my $id (0..$x)
{
  my $cat = $ids{$id};
  if(!$ret{$ids{$id}}{'count'})
  {
    $ret{$ids{$id}}{'count'} = 0;
    $ret{$ids{$id}}{'head_count'} = 'head_count';
    $ret{$ids{$id}}{'cat_count'} = 'cat_count';
    $ret{$ids{$id}}{'subcat_count'} = 'subcat_count';
  }
  $ret{$ids{$id}}{'count'}++;
}
@keys = keys %ret;
$cnt = scalar @keys;
my $end_time = time();
my $total_time = $end_time - $start_time;
print STDERR "Built $cnt entries to start with, $total_time seconds\n"
+;
while(my($key,$val) = each(%ret))
{
  my @parts = split(/\./, $key);
  my $id = '';
  foreach my $add (@parts)
  {
    $id .= $add;
    if($id eq $key)
    {
      next;
    }
    if(!$ret{$id}{'count'})
    {
      $ret{$id}{'count'} = 0;
      $ret{$id}{'head_count'} = 'head_count';
      $ret{$id}{'cat_count'} = 'cat_count';
      $ret{$id}{'subcat_count'} = 'subcat_count';
    }
    $ret{$id}{'count'} += $ret{$key}{'count'};
    $id .= '.';
  }

}
$end_time = time();
$total_time = $end_time - $start_time;
my @sorted_keys = sort keys(%ret);
my $tot = $ret{'a'}{'count'};
print STDERR "Completed in $total_time seconds ($tot)\n";
foreach my $key (@sorted_keys)
{
  my $cnt = $ret{$key}{'count'};
    #print "KEY: $key $cnt\n";
}
print STDERR "Sleeping...\n";
sleep(1000);
[download]

Most of this script is just being used to setup the data/cache for later use. The real work I am trying to measure begins with the "Returning mapped categories for values..." and ends at the "Completed in $total_time" message.

As you can probably tell from the code I am setting up 2812 categories and testing how quickly I can map 100,001 of them.

The last sleep is just so I can look at the process and guage how much memory it is using.

So with all this being said, I get a run time of about 2.1 seconds on my development system. Do you see any obvious ways that this section of the code could be improved, considering the importance of raw performance here?

My second question is... The recipient of what this server would output would actually be a PHP script. I wrote a PHP version of this same test script, but the performance was much worse (3.6 seconds). What might be a quick and efficient means to connect this up to PHP/Apache?

.... Of course, I am very open to any other possible solutions to this problem, this is just one I thought of tonight, but I'm sure there must be a better way...

Thank you for your help and consideration!

- dEvNuL

In reply to Architectural question... by devnul

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.