Using simultaneous threads

mhearse has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to get started with the concept of multiple threads. Basically what I'm trying to do is run two system commands simultaneously. I'm not sure if it matters, but the command in question is mysqldump. I have no idea how to do this. Can someone point me in the right direction? I would like to run 2 of the following subroutines simultaneously until all databases have been dumped (please note that this is a snippet):

for my $db (keys %{$db_ds}) {
    if (!@{$db_ds->{$db}} && !$opts{database}) {
        ### Our data stuctur says there is no native mysql tables.
        ### So we skip this database.
        next;
    }
    $dbh->do(qq{use $db});
    ### The table list in our data structure is empty.  The database
    ### option has been passed, so look them up.
    if (!@{$db_ds->{$db}}) {
        $queries{get_table_names}->execute();
        while (my $rec = $queries{get_table_names}->fetchrow_hashref()
+) {
            push @{$db_ds->{$db}}, $rec->{Name};
        }
        debug(qq{Got table list for $db\n});
    }

    ### This block makes sure we can actually access the tables.
    my @valid_tables;
    for my $table (@{$db_ds->{$db}}) {
        $queries{verify_table_name}->execute($table);
        if ($queries{verify_table_name}->rows()) {
            my ($rec) = $queries{verify_table_name}->fetchrow_array();
            push @valid_tables, $rec;
        } else {
            debug(qq{There was a problem finding database/table $db $t
+able\n});
        }
    }

    ### Make sure we aren't running out of filesystem space.
    chk_space($max_disk_space);

    my $current_date = strftime("%Y-%m-%d", localtime);
    my $current_time = strftime("%H:%M:%S", localtime);

    my $dumpfile =
        '/backups/table_dumps/'
        . $db
        . '.'
        . $current_date
        . '.'
        . $current_time;

    my $valid_table_list = join ' ', @valid_tables;

    my $cmd =
        'mysqldump -q --single-transaction --complete-insert -e'
        . ' '
        . $db
        . ' '
        . $valid_table_list
        . $bar_cmd
        . ' |';

    debug(qq{Beginning dump of database: $db\n});
    debug(qq{Running command: $cmd\n});
    debug(qq{Logging to: $dumpfile\n});
    open MYSQLDUMP, "$cmd" or die $!;
    open DUMPFILE, ">$dumpfile" or die $!;

    my $total_bytes = 0;
    while (my $bytes_read = read(MYSQLDUMP, my $buffer, 4096)) {
        $total_bytes += $bytes_read;
        print DUMPFILE $buffer;
    }

    close DUMPFILE;
    debug(qq{Finished dumping $db\n\n});
}
[download]

Comment on Using simultaneous threads Download Code

Replies are listed 'Best First'.
Re: Using simultaneous threads by BrowserUk (Patriarch) on May 13, 2008 at 03:16 UTC
Running two threads performing concurrent accesses to a db via DBI is problematic. It might work for you, or it might not depending upon the design and implementation of the DBD::* driver and the vendor supplied API libraries/DLLs that it runs on top of. If they are not reentrant, or use (for example) the process-id of the calling app to coordinate access, then concurrent access from 2 or more threads of the same process can cause problems. I'm not sure about the reentrancy of the MySQL DBI/DBD/API chain. However, if you are prepared to split the function above into 2, then you will be able to do most of what you want. The split is to run the first half of the code that queries DBs and theirs tables from the DBM in the man thread, and once you have a complete set of information, pass it into the threads (via a queue) and have them do the second half, of checking filesystem space and actually running the dump. The basic structure of the app would be as shown in Re: Question: Fast way to validate 600K websites. Substitute Reading from the DB for reading from a file. Instead of pushing the url to the queue, push the DBname and table names preformatted as a single string for direct inclusion into the command. If you only want two threads, only start two, but if your looking to maximise throughput, with IO bound tasks like these it is often as well to have at least two threads per CPU. A question: Why are you reading the data in from the dump command just to write it straight out again having done nothing to it? It would be quicker and simpler to just have mysqldump write it directly to a file. If it is just to count the bytes, then it would be easier to just query the size of the file once it has completed. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re^2: Using simultaneous threads by mhearse (Chaplain) on May 13, 2008 at 15:27 UTC
Thanks. The post you referenced gets me started. Right off, I have one question. How do I install signal handlers for each of the threads? A situation may arise where I have to kill everything. Hopefully not the hard way. I'm under the impression (possibly falsely) that I need to kill the $pid for mysqldump to ensure that the thread is joinable. I'm not sure how to do that when dealing with multiple threads. `my @signals = qw(TERM ALRM INT HUP); for (@signals) { $SIG{$_} = sub { $rt->debug(qq{Caught signal: $_.\n}); kill 9, $pid; $thread->join(); exit; }; }` [download]	[reply] [d/l]
Re^3: Using simultaneous threads by pc88mxer (Vicar) on May 13, 2008 at 16:32 UTC
Note that calling `join` on a thread waits for it to complete, so to prevent blocking you'll have to find a way to ensure that your thread terminates. Also, you might have to do some experimentation to determine which thread will get the signal. I think you are better off using `fork` in this situation -it's going to be a lot simpler. Just keep track of the pids you create: `our @pids; ... for my $db (keys %{$db_ds}) { ... my $pid = fork(); if ($pid == 0) { exec(...); } else { push(@pids, $pid); } }` [download] and then call `kill 9, @pids` in your signal handler. Finally, be sure to read `perlthrtut` -- it mentions some caveats about using signals and threads.	[reply] [d/l] [select]
Re^3: Using simultaneous threads by BrowserUk (Patriarch) on May 13, 2008 at 17:55 UTC
With recent versions of threads, you can install per-thread signal handlers. Using these in conjunction with a signal handler in your main thread, you can forward process signals to your threads and have each one deal with it appropriate to it's context. In this case, killing the current child process. This is necessarily untested code, but it should serve to demonstrate the idea: #! perl -slw use strict; use threads; use Thread::Queue; use LWP::Simple; our $N \|\|= 2; my $Q = new Thread::Queue; sub dbDump { my $pid; ## Add per-thread signal handlers closing over the $pid @SIG{ qw[TERM ALRM INT HUP] } = ( sub{ kill 9, $pid } ) x 4; my $opts = '-q --single-transaction --complete-insert'; while( my $dbinfo = $Q->dequeue ) { my( $dbname ) = split ' ', $dbinfo, 1; my $outfile = $dbname . localtime() . '.dmp'; my $pid = open my $cmd, "mysqldump $opts $dbinfo --results-file=$outfile \|" or die $!; } } my @pool = map &async( \&dbDump ), 1 .. $N; ## Add main thread sig handlers to relay process signals to threads @SIG{ qw[TERM ALRM INT HUP] } = ( sub{ $_->kill( 'TERM' ) for @pool } +) x 4; for my $db (keys %{$db_ds}) { if (!@{$db_ds->{$db}} && !$opts{database}) { ### Our data stuctur says there is no native mysql tables. ### So we skip this database. next; } $dbh->do(qq{use $db}); ### The table list in our data structure is empty. The database ### option has been passed, so look them up. if (!@{$db_ds->{$db}}) { $queries{get_table_names}->execute(); while (my $rec = $queries{get_table_names}->fetchrow_hashref() +) { push @{$db_ds->{$db}}, $rec->{Name}; } debug(qq{Got table list for $db\n}); } ### This block makes sure we can actually access the tables. my @valid_tables; for my $table (@{$db_ds->{$db}}) { $queries{verify_table_name}->execute($table); if ($queries{verify_table_name}->rows()) { my ($rec) = $queries{verify_table_name}->fetchrow_array(); push @valid_tables, $rec; } else { debug(qq{There was a problem finding database/table $db $t +able\n}); } } $Q->enqueue( join ' ', $db, @valid_tables ); } $Q->enqueue( (undef) x $N ); $_->join for @pool; [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re^4: Using simultaneous threads by mhearse (Chaplain) on May 14, 2008 at 06:16 UTC
Re: Using simultaneous threads by pc88mxer (Vicar) on May 13, 2008 at 03:10 UTC
Here's a simple solution that can use if: your perl program doesn't need to process the output of `mysqldump` as it is being generated, and you don't need to wait for the `mysqldump` commands to finish If both of these are true, then structure your code like this: `for my $db (keys %{$db_ds}) { ...figure out which table to dump, etc... system("$cmd > $dumpfile &"); }` [download] The only difference will be that `$cmd` won't end with a pipe `\|`. Moreover, since you are running `mysqldump`, you can also use the `-r` option to direct output to the dump file instead of redirecting standard output. Indeed, this is better since then you can use the safer, multi-argument version of `system`. If you need to wait for the `mysqldump` commands to finish, then it is just a little bit trickier: `for my $db (keys %{$db_ds}) { ...figure out which table to dump, etc... unless (defined(my $pid = fork)) { die "unable to fork: $!\n"; } if ($pid == 0) { exec("$cmd > $dumpfile"); die "unable to exec for table $table: $!\n"; } } # now wait for all the children to finish 1 while (wait > 0);` [download]	[reply] [d/l] [select]
Re: Using simultaneous threads by jethro (Monsignor) on May 13, 2008 at 03:25 UTC
Is there any reason why you shuffle the output of the mysqldump through perl? You don't seem to do anything else with the data. Why not let a shell pipe do the work? `system("$cmd > $dumpfile")==0 or die ...;` [download] Now to start two of them simultaneous you could do a fork (which is clean and easy). Small caveat: You can easily check whether the child process finished, but to get a status/success/failure message, you would need a file or some IPC. But don't worry, some helpful monk will probably tell you about a module that already does most of this. Just for completeness sake: There is also the possibility of letting the shell fork: `system("$cmd > $dumpfile &")==0 or die ...;` [download] The '&' makes sure the system call returns immediately, so that you can start the second dump directly afterwards. To find out whether the child finished you could check the output of ps, but that is a dirty und unsafe hack in my view.	[reply] [d/l] [select]
Re^2: Using simultaneous threads by waba (Monk) on May 13, 2008 at 17:17 UTC
Small caveat: You can easily check whether the child process finished, but to get a status/success/failure message, you would need a file or some IPC. But don't worry, some helpful monk will probably tell you about a module that already does most of this. I remember using Proc::SafeExec some weeks ago and finding its interface quite intuitive. I was solving a simpler problem, but it will probably work here just as well.	[reply]
Re: Using simultaneous threads by GrandFather (Saint) on May 13, 2008 at 03:37 UTC
What OS? Does the code run in a GUI? Perl is environmentally friendly - it saves trees	[reply]
Re: Using simultaneous threads by mhearse (Chaplain) on May 13, 2008 at 10:24 UTC
Just wanted to address some of the questions. The os is a current version of Red Hat. X11 is not installed, so no gui. For a program like this I usually run it from cron on a detached screen. I'm reading the output becuase I had code which was calculating and printing the bytes/second read and the wall time the dump has been running. I decided to use a freeware progress bar instead, so I can probably remove that block. The machine in question has 8 cpus and plenty of IO bandwidth so I believe dumping two different databases at once would speed up the backup process (which involves duping around 3 gb of data).	[reply]