muertetg has asked for the wisdom of the Perl Monks concerning the following question:

NOTE: I asked this question before signing up a bit ago and now I can't find it, heh. So here it is again from my brand spankin new perl monks account :) Sorry it's long.

Heya, monks :)

I'm running Perl 5.8.7 on Windows Server 2003 and connecting to SQL Server 2005 via DBI.

I have been rebuilding a perl report and came across an interesting discrepancy while approaching some calculations from two angles. To give some context, this is for a system that keeps track of agencies that scan products. I need to find the number of unique products scanned by each user for an agency within a certain time frame, by querying the scan events table.

At this point, I already know which users belong to which agencies, but I am trying to calculate them all together, rather than one agency at a time like the old report.

The old report used the following query over and over for each agency:

select username, count(distinct product) from events where username in $user_string and gmtdatetime >= '$min_date' and gmtdatetime < '$max_date' group by username;

and I receive a result. However, running this query over and over for all users in all agencies takes a bad toll on sql server. So in order to get system wide calculations, I opted to pull the records without grouping or counting, and sort out the unique ones directly in perl, which seems to take much less time. Unfortunately, the numbers calculated in this way are much higher than querying for the counts directly, and I don't know why. I use the following query:

select username, product, from events where username in $user_string and gmtdatetime >= '$min_date' and gmtdatetime < '$max_date'

and then I create two hashes, one that holds the unique products per agency, and one that holds the unique products per user, using the product id as the hash key in both instances. I do a check to see if the product exists in the agency hash and if it does not, I add it to the agency hash and also to the user hash. At the end, I count the keys on each element in the user hash to get the unique products scanned for that user:

while(my @row = $res->fetchrow_array()) { my ($username, $product) = @row; # mark product as used if(!exists($agency_products{$user_agency{$username}}{$product})) { $agency_products{$user_agency{$username}}{$product} = 1; $user_products{$username}{$product} = 1; } }

Then I use the following to get the count for each user:

foreach my $i_user (keys(%stats)) { $stats{$i_user}{'UNIQUE_PRODUCTS'} = keys(%{$user_products{$i_user +}}); }

But this gives me a higher result than the sql only method. Does anyone have any inkling as to why this might be?

Thanks in advance for your help!

Ryan

Replies are listed 'Best First'.
Re: Perl/SQL Server Count Discrepancy
by ELISHEVA (Prior) on Aug 14, 2009 at 06:18 UTC

    Your first stop should be your DBA or a schema dump. If the product and user name columns are properly indexed it shouldn't be taking a huge server toll. If it is indexed, I'd run the query through EXPLAIN (or the equivalent command if you are using a DBMS other than mysql) to see if it is using the index. Some DBMSs don't do a good job optimizing queries run on views, especially if temporary tables are involved (the GROUP BY clause often creates such a table). In that case do the query on the original tables rather than the views. Check with your DBA if you need special permissions set up for this.

    Using Perl to optimize a database query should usually be a last resort.

    If you really do insist on using Perl, two separate hashes while reading in rows is not the way to do it. The net effect of the original query is to select all unique pairings of user_name/product and print out the count of products per user. If you are intent on emulating the effect of GROUP BY and count(distinct), you really only need one hash:

    my %hUsers; #find all unique combinations of user and product while(my @row = $res->fetchrow_array()) { my ($username, $product) = @row; $hUsers->{$username}{$product} = 1; } # print product counts per user foreach $user_name (sort keys %$hUsers) { print $user_name, ' ', scalar keys $hUsers{$user_name}, "\n"; }

    A final note. If you decide to handle the product count per user in Perl, you might want to use DISTINCT in your query: SELECT DISTINCT username, product ... will reduce the number of rows returned by the server without the optimization problems sometimes caused by GROUP BY clauses.

    Best, beth

    Update: added code to emulate query.

    Update: suggested use of DISTINCT.