in reply to Re: threads::shared seems to kill performance
in thread threads::shared seems to kill performance

Hello BrowserUK, threading master of masters from what I hear! Thank you for your response.

I'm aware I'm probably breaking several laws and killing small kittens in the process by allocating hash array this big.

Originally the data comes from a SQLite database. There's on huge table that's tied via 2 levels of parameters - say: owner, date, some_data (with <owner,date being> unique and set of owners relatively small) - and with loading this into those hashes, I'm trying to introduce some structure to the data so that I can later access it from my program in a way I can easily understand and work with ($data{user}{date}[]).

Strangely loading the data from the database doesn't have as big of an impact on the performance as the sharing does. In my real life tests - where I in fact do initialize the hash and populate it in one pass as you suggest - the loading from DB and populating the hash (with significantly reduced set of data) took about 2s. Once I added the sharing (in a way similar to my example above), it took about 26s.

  • Comment on Re^2: threads::shared seems to kill performance

Replies are listed 'Best First'.
Re^3: threads::shared seems to kill performance
by BrowserUk (Patriarch) on Jul 18, 2013 at 06:04 UTC
    Originally the data comes from a SQLite database....

    Then I very strongly advise against taking the data out of the db and putting it into a hash.

    Not only will doing so take considerable time and substantial space, although for read-only use you won't need locking, there is no way to turn off the locking Perl uses to protect its internals, and that will bring your application to a crawl.

    Instead, share the db handle and create statement handles for your queries. Whilst I haven't done this personally (yet), according to this, the default 'serialized' mode of operation means that you don't even need to do user locking as the DB will take care of that for you.

    If you create/clone your DB as an in-memory DB, after you've spawned your threads; then you will avoid the duplication of that DB and the performance should be on a par with, and potentially faster than a shared hash.

    When I get time, which may not be soon, I intend to test this scenario for myself as I think it might be a good solution to sharing large amounts of data between threads. Something Perl definitely needs.

    It may even be possible to wrap it over in a tied hash to simplify the programmers view of the DB without incurring the high overheads of threads::shared (That's very speculative!).

    In any case, as your data is already in a DB; don't take it out and put it in shared hashes. That just doesn't make sense. Just load it into memory after you threads are spawned; and then set the dbh into a shared variable where the threads can get access to it.

    At least, that is what I would (and will) try.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      BrowserUk:

      Since I can only ++ a post once, you'll have to settle for a few more virtuals: ++ ++ ++

      ...roboticus

      When your only tool is a hammer, all problems look like your thumb.

      I actually originally started by only loading data from the DB in smaller chunks - the way I'm using that DB allows me to split the whole table into about 5000 smaller ones. So instead of doing one SELECT and creating one huge hash (which always felt a bit wrong to me), I did 5000 smaller SELECTs and worked through those in sequence.

      Performance of this however was terrible - roughly speaking about 20 seconds for loading that 1 big SELECT from DB vs 220 seconds for 5000 smaller SELECTs - all via one DB handle. And this was on SSD.

      As for the in-memory DB, this is the first time I've heard about it and admittedly it looks very promising. Thanks for the hint - let me give that a try...

      @BrowserUK, what exactly do you mean by this?

      <quote>Instead, share the db handle and create statement handles for your queries. </quote>

      I'm using DBI and that doesn't seem to be very thread friendly. Also SQLite DBD doesn't mention threading anywhere in the documentation.

      When I try to pass the db handle to my thread as a parameter, I get this:

      $g_dbh = DBI->connect("dbi:SQLite:dbname=:memory:"); threads->create(\&my_thread, $g_dbh); Thread 1 terminated abnormally: DBD::SQLite::db prepare failed: handle + 2 is owned by thread 7f7f64003200 not current thread 7f7f6455fc00 (h +andles can't be shared between threads and your driver may need a CLO +NE method added)

      If I try to share the db handle, I get:

      our $g_dbh :shared; $g_dbh = DBI->connect("dbi:SQLite:dbname=:memory:"); Invalid value for shared scalar

        Okay. I found an obscure detail(Section entitled "In-memory Databases And Shared Cache" ) in the sqlite documentation that allows me to work around DBI's brokeness.

        This demonstrates it:

        #! perl -slw use strict; use threads; use threads::shared; use threads::Q; use Time::HiRes qw[ time ]; use DBI; use constant { CONNECT=> 'dbi:SQLite:dbname=file:memdb?mode=memory&cache=shared', CREATE => 'create table if not exists DB ( ID integer(8),' . join(',', map "F$_ text(15)", 1..9) . ')', INSERT => 'insert into DB ( ID , ' . join( ',', map "F$_", 1..9 ) . ') values (' . '?,' x 9 . '?) +', INDEX => 'create index if not exists I1 on DB ( ID )', QUERY => 'select * from DB where ID = ?', }; sub thread { my $tid = threads->tid; my( $Q ) = @_; $Q->dq; ## Wait for DB my $dbh = DBI->connect( CONNECT, '', '' ) or die DBI::errstr; my $sth = $dbh->prepare( QUERY ) or die DBI->errstr; while( my $id = $Q->dq ) { $sth->execute( $id ) or die DBI::errstr; my $r = $sth->fetch or warn( "No data for $id" ) and next; ## do something with record. printf "[$tid] %5u %s %s %s %s %s %s %s %s %s\n", @{ $r }; } $sth->finish; $dbh->disconnect; } my @chars = ( 'a'..'z' ); sub dummy { my $n = shift; join '', @chars[ map int( rand @chars ), 1 .. $n ]; } our $T //= 4; our $N //= 100; my $dbh = DBI->connect( CONNECT, '', '', { AutoCommit =>0 } ) or die D +BI::errstr; $dbh->do( 'PRAGMA synchronous = off' ); $dbh->do( 'PRAGMA cache_size = 800000' ); $dbh->do( CREATE ) or die DBI::errstr; my $ins = $dbh->prepare( INSERT ) or die DBI->errstr; for my $n ( 1 .. $N ) { my @fields = ( $n, map dummy( 15 ), 1 .. 9 ); $ins->execute( @fields )or die $ins->errstr; $n %100 or $dbh->commit } $ins->finish; $dbh->commit; #$dbh->do( INDEX ) or die DBI::errstr; my $sth = $dbh->prepare( QUERY ) or die DBI->errstr; for my $id ( 1 .. 5 ) { $sth->execute( $id ) or die DBI::errstr; my $r = $sth->fetch() or warn( "No data for $id" ) and next; ## do something with record. printf "[main] %5u %s %s %s %s %s %s %s %s %s\n", @{ $r }; } $sth->finish; print $dbh->selectrow_array( 'SELECT count(*) from DB' ); my $Q = threads::Q->new( 10 ); my @threads = map threads->create( \&thread, $Q, shared_clone( $dbh ) +), 1 .. $T; $Q->nq( ('Wakey wakey') x $T ); $Q->nq( $_ ) for 1 .. 20; ## $N; $Q->nq( (undef) x $T ); $_->join for @threads; $dbh->disconnect; unlink 'file'; ## without this, the memory db will persist in a appare +ntly empty file???? ## That might be turned into an advantage????

        There is however some weirdness associated with this. (See the comments on the last two lines above).

        Basically, the trick to making multiple DBI handles (in different threads) refer to the same in memory database is using a dbname of the form:

        file:memdb?mode=memory&cache=shared

        The weirdness is that although this is a "memory db", a file (in this example called 'file' in the local directory) is created. It will always be 0 bytes and running wc on it confirms it has no content.

        But, run the above program a second time and it will add to the number of records in the table. IT WILL DOUBLE IN SIZE!

        But if you delete that empty file, the records from the first run disappear!

        Not sure what to do about that other than just delete the file.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        Sorry. It looks like we'll have to try and locate some alternative to (DBI) using sqlite from Perl.

        DBI is broken beyond understanding and the documentation is so crap that working out how to fix it is beyond my abilities.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.