zwon has asked for the wisdom of the Perl Monks concerning the following question:

Evening!

I'm trying to solve the problem that looks rather standard, but yet I couldn't find trivial solution. So I'm asking your help. Here's the simple DBIx::Class schema:

package MyDatabase; use base qw/DBIx::Class::Schema/; __PACKAGE__->load_namespaces; 1;
package MyDatabase::Result::Artist; use strict; use warnings; use base qw/DBIx::Class::Core/; __PACKAGE__->table('artist'); __PACKAGE__->add_columns( artistid => { data_type => 'integer', is_auto_increment => 1, }, name => { data_type => 'text', is_nullable => 0, } ); __PACKAGE__->set_primary_key('artistid'); __PACKAGE__->add_unique_constraint(['name']); __PACKAGE__->has_many( 'cds' => 'MyDatabase::Result::Cd' ); 1;
package MyDatabase::Result::Cd; use strict; use warnings; use base qw/DBIx::Class::Core/; __PACKAGE__->table('cd'); __PACKAGE__->add_columns( cdid => { data_type => 'integer', is_auto_increment => 1, }, artist => { data_type => 'integer', is_nullable => 0, }, name => { data_type => 'text', is_nullable => 0, }, ); __PACKAGE__->set_primary_key('cdid'); __PACKAGE__->belongs_to( 'artist' => 'MyDatabase::Result::Artist' ); 1;

In my case the "artist" table is relatively small, so I can keep it in memory without any problems, records in this table are nether deleted or changed, but new records may be added. I need to insert a lot of records into "cd" table. I'm trying to find method of doing it both elegant and effective, but so far I failed. The problem is that I can't make DBIx::Class to cache "artist" table in memory, so every time I insert new record into "cd" it performs select from "artists". Here's the script that demonstrates the problem:

use strict; use warnings; use Benchmark; use MyDatabase; my $NUM = 10000; my $schema = MyDatabase->connect( "dbi:SQLite:dbname=cd.sqlite", '', ' +', { AutoCommit => 0 } ); my @artists = map { "Artist$_" } 1 .. 20; # recreate database sub resetdb { $schema->deploy( { add_drop_table => 1 } ); $schema->resultset('Artist')->populate( [ map { { name => $_ } } @ +artists ] ); } ####### # This looks good and correct, but slow sub test1 { resetdb(); my $rs_cd = $schema->resultset('Cd'); for ( 1 .. $NUM ) { $rs_cd->create( { artist => { name => $artists[ $_ % @artists ] }, name => "Cd$_", } ); } } ####### # This one I found in Catalyst Advent Calendar: # http://www.catalystframework.org/calendar/2009/17 # Looks good, but the same problem -- it's slow sub populate { resetdb(); my @data = map { { artist => { name => $artists[ $_ % @artists ] } +, name => "Cd$_" } } 1 .. $NUM; $schema->resultset('Cd')->populate( \@data ); } ####### # This one works fast, but looks not so good, especially taking into a +ccout # that I have more than one related table. Also it doesn't insert new +artists. sub test2 { resetdb(); my $rs_cd = $schema->resultset('Cd'); # save artists table into hash my %artists = map { ( $_->name, $_->artistid ) } $schema->resultse +t('Artist')->all; for ( 1 .. $NUM ) { $rs_cd->create( { artist => $artists{ $artists[ $_ % @artists ] }, name => "Cd$_", } ); } } ####### # This one works fast and correct, but looks awfull. sub test3 { resetdb(); my $rs_cd = $schema->resultset('Cd'); # save artists table into hash my %artists = map { ( $_->name, $_->artistid ) } $schema->resultse +t('Artist')->all; for ( 1 .. $NUM ) { my $aname = $artists[ $_ % @artists ]; # create record for artist if it doesn't exists unless ( $artists{$aname} ) { $artists{$aname} = $schema->resultset('Artist')->create( { + name => $aname } )->artistid; } $rs_cd->create( { artist => $artists{$aname}, name => "Cd$_", } ); } } timethese( 1, { 'test1' => \&test1, 'populate' => \&populate, 'test2' => \&test2, 'test3' => \&test3, } ); __END__ Benchmark: timing 1 iterations of populate, test1, test2, test3... populate: 21 wallclock secs (20.38 usr + 0.05 sys = 20.43 CPU) @ 0 +.05/s (n=1) (warning: too few iterations for a reliable count) test1: 20 wallclock secs (20.44 usr + 0.02 sys = 20.46 CPU) @ 0 +.05/s (n=1) (warning: too few iterations for a reliable count) test2: 6 wallclock secs ( 5.39 usr + 0.01 sys = 5.40 CPU) @ 0 +.19/s (n=1) (warning: too few iterations for a reliable count) test3: 5 wallclock secs ( 5.43 usr + 0.01 sys = 5.44 CPU) @ 0 +.18/s (n=1) (warning: too few iterations for a reliable count)

Did I missed something simple? Do you have any ideas how to solve this problem?

Replies are listed 'Best First'.
Re: DBIx::Class. Cache table in memory
by jethro (Monsignor) on Jan 13, 2010 at 12:37 UTC

    Two comments:

    1) If sqlite isn't caching already, the operating system is. So you are not reading the authors from hard disk but from memory already

    2) Is this an optimization that makes sense to look for? Even if you have one million records you want to store eventually, it will take half an hour combined. This half hour (presumably) is a one-time cost, not something you have to redo every day or every month.

    It is a hard lesson to take (even though I preach it I don't follow this advice usually) but optimization should be done on a finished program and only after finding the hot spot (i.e. the lines of code where most time is spent during a typical program execution). And usually such a hot spot exists, so that 95% of the time is spent in only 5% of your code. It doesn't make much sense to optimize the other 95% of your code

      If sqlite isn't caching already, the operating system is.

      In real application I'm using postgres, but yes, you're absolutely right, 10000 additional requests can't take 15 seconds, I had to think about it, thanks. I ran it with Devel::NYTProf and have found that it spends a lot of time inside Class::Accessor::Grouped and inside various DBIx::Class modules, but not in DBI and DBD::SQLite. So I guess this overhead is caused by creating resultsets for 'Artist'. Pity.

      Is this an optimization that makes sense to look for?

      Sure. I'm reworking log processing application, that has serious performance problems, and hoped to replace current DB abstraction layer with DBIx::Class, as our current custom module is not very flexible and would require a lot of work after I fixed a DB scheme.

      Concerning your last statement, generally I would agree with you, but in this case I know there the hot spots will be and I prefer to avoid them by spending some time on benchmarking and experimenting before I'll be limited by existing program structure. I think it will save me some time from these 95% in the end, so I could spend it here :)