Re: Multithreaded Script CPU Usage
by Illuminatus (Curate) on Aug 25, 2008 at 15:36 UTC
|
I think I need a little more info about your program structure. It looks like new threads invoke 'do_handler', but it is not clear how this, in turn, invokes and manages FileList. I would not so much worry about perl's memory usage -- 125M is not all that much. Much more troubling is the 75% CPU. If FileList is supposed to be doing most of the work, and the perl part is simply 'glue' taking output from FileList and inserting it into MySQL, then it should not be doing much. If it is not too large, could you post the do_handler, and what you are doing to put stuff into MySQL? | [reply] |
|
Illuminatus, yes that is in fact what I am mostly worried about. In fact I think the Mem Usage here is due to one of the threads doing a regex replace, adding backslashes to prepare a file to be imported into mysql.
Edit: Just to clarify, I know that the CPU use should also be high when doing a regex on a large file with many matches, however, the CPU usage of the perl.exe process is always high, even when no regex operations are being performed.
My do_handler function just routes the work to a function in the separate Worker module (I couldn't figure out the syntax to call it directly)
Here's the relevant code from the Worker module:
package File::FileListerWorker;
sub main_run
{ # arguments: 1) job id from Pooler 2) path to scan 3) Path to FileLi
+st.exe 4)table name
# returns: 1) return 1;
#### - START: MAIN EXECUTABLE CODEBLOCK - ####
my $jobid = shift; # read scanpaths from the ARGs
my $scanpaths_line = shift; # get path to scan
my $fl_path = shift; # define path to FileList.exe
my $arg_tbl_name = shift; # table to load data into
my $dbh = &mysql_conn(); # connect to mysql
my $mysql_tablename = &mysql_create_table( $arg_tbl_name, $dbh );
+# create table (IF NOT EXISTS)
my @ta = localtime(time); # get time
my $time_string =
$ta[4] . $ta[3] . ( $ta[5] + 1900 ) . "_" . $ta[2] . $ta[1] . $t
+a[0]; # time string
my $temp_filename = # make unique filename
"\\\\directoryToPlaceTempFileIn\\"
. $jobid . "_"
. $time_string . ".csv";
&do_scan( $fl_path, $scanpaths_line, $temp_filename ); # call to s
+ub that does the scan
my $mysql_temp_filename = &fix_file_4mysql($temp_filename); # add
+backslashes
&mysql_load_data( $mysql_temp_filename, $mysql_tablename, $dbh );
+# mysql import file
&cleanup($temp_filename); #delete temp file to preserve space
&mysql_disc($dbh); # disconnect mysql
#### - END: MAIN EXECUTABLE CODEBLOCK - ####
return 1;
}
sub do_scan
{ #arguments: 1)location of FileList.exe 2)path to scan 3)output fi
+le path
#returns: none
my $fl_path = shift;
my $pth = shift;
chomp($pth);
my $temp_filename = shift;
system(qq{$fl_path $pth >$temp_filename});
}
| [reply] [d/l] |
|
Just an idea... what if everything is perfectly right that way?
I am not sure how costly fix_file_4mysql() and
mysql_load_data() are, but if the do_scan() execution
is much faster than loading the DB, perl.exe is still digesting
the output of previous runs of FileList.exe.
So, you'll might observe only four FileList.exe jobs that
are currently running while the output of several
dozen previous runs is still processed.
How big is max => $total_threads?
The number of $temp_filename's currently existing
might provide a rough estimation of how many workers are
currently busy.
| [reply] [d/l] [select] |
Re: Multithreaded Script CPU Usage
by renodino (Curate) on Aug 25, 2008 at 15:01 UTC
|
| [reply] |
|
Updated OP with the version info.
As for your #3 I think the answer is no, but I cant seem find where to set that in Thread::Pool::Simple.pm
Would changing 'use threads;' to
use threads ( 'stack_size' => 64*4096, );
in Thread::Pool::Simple.pm be the way to set the stack size?
if so, whats a 'small' value?
| [reply] [d/l] |
|
Your "use threads" should do the trick, or any of the alternatives described in the "THREAD STACK SIZE" section of the threads POD...just do it as early as possible. I'd suggest that 64 * 4096 is probably a lot more than you need, as Perl has its own notion of a runtime stack, and in reality doesn't use that much process/thread stack (unless you're using very large/complex regexes, and even that issue goes away in 5.10). Note that once stack size is set, its set for all the threads/modules, so you don't need to do anything in Thread::Pool.
Also keep in mind that each thread has its own Perl interpretter, so memory size may be an unavoidable issue, esp if the threads are spawned from another thread that already has a lot of context (due to the interpretter cloning).
Update:
I just realized your issue has 2 parts: memory and CPU; note that my comments thus far have been entirely about the memory issue. However, given the size of the Perl process, might the CPU issue be related to paging/swapping ?
Perl Contrarian & SQL fanboy
| [reply] |
|
Re: Multithreaded Script CPU Usage
by BrowserUk (Patriarch) on Aug 26, 2008 at 00:44 UTC
|
I suspect that your cpu usage problem is related to this failure when running the Thread::Pool::Simple test suite:
c:\Perl\packages\Thread-Pool-Simple-0.23>nmake test
Microsoft (R) Program Maintenance Utility Version 7.00.9466
Copyright (C) Microsoft Corporation. All rights reserved.
C:\Perl\bin\perl.exe "-MExtUtils::Command::MM" "-e" "test_harn
+ess(0, 'blib\lib', 'blib\arch')" t\1.t t\2.t t\3.t t\4.t
t\1....ok 1/0Can't call method "down" on unblessed reference at C:\Per
+l\packages\Thread-Pool-Simple-0.23\blib\lib/Thread/Pool/Simple.pm lin
+e 39.
# Looks like your test died just after 1.
t\1....dubious
Test returned status 255 (wstat 65280, 0xff00)
after all the subtests completed successfully
t\2....ok 1/0Can't call method "down" on unblessed reference at C:\Per
+l\packages\Thread-Pool-Simple-0.23\blib\lib/Thread/Pool/Simple.pm lin
+e 39.
# Looks like your test died just after 1.
t\2....dubious
Test returned status 255 (wstat 65280, 0xff00)
after all the subtests completed successfully
t\3....ok 1/0Can't call method "down" on unblessed reference at C:\Per
+l\packages\Thread-Pool-Simple-0.23\blib\lib/Thread/Pool/Simple.pm lin
+e 39.
# Looks like your test died just after 1.
t\3....dubious
Test returned status 255 (wstat 65280, 0xff00)
after all the subtests completed successfully
t\4....ok 1/0Can't call method "down" on unblessed reference at C:\Per
+l\packages\Thread-Pool-Simple-0.23\blib\lib/Thread/Pool/Simple.pm lin
+e 39.
# Looks like your test died just after 1.
t\4....dubious
Test returned status 255 (wstat 65280, 0xff00)
after all the subtests completed successfully
Failed Test Stat Wstat Total Fail Failed List of Failed
----------------------------------------------------------------------
+---------
t\1.t 255 65280 1 0 0.00% ??
t\2.t 255 65280 1 0 0.00% ??
t\3.t 255 65280 1 0 0.00% ??
t\4.t 255 65280 1 0 0.00% ??
Failed 4/4 test scripts, 0.00% okay. 0/4 subtests failed, 100.00% okay
+.
NMAKE : fatal error U1077: 'C:\Perl\bin\perl.exe' : return code '0x2'
Stop.
This failure can be traced to this line from the T::P::S code:
my $self = &share({});
Which make the T:::P::S object a shared hash. The problem is that when a few lines later, it creates an instance of Thread::Semaphore and stores the returned handle into that hash:
$self->{shutdown_lock} = Thread::Semaphore->new();
Because the hash is shared, the blessedness of the Thread::Semaphore object is lost. Ie. It stops being a blessed scaler and becomes a plain ordinary scalar. And when it attempts to call a methd on that scalar:
$self->{shutdown_lock}->down(); ## line 39
it fails (Note:The line number is different because I added some debug:
???SCALAR(0x19a690c)
Can't call method "down" on unblessed reference at
C:\Perl\packages\Thread-Pool-Simple-0.23\blib\lib/Thread/Pool/Simple.p
+m
line 43.
This is a newly introduced bug that didn't exist when I tried T::P::Simple a few few weeks ago. Now, you can "fix" this test failure by upgrading to the latest CPAN versions of threads and threads::shared.
Unfortunately, with that upgrade, as with so many of the releases ever since they were dual lived, comes a different set of problems. The T::P::Simple now completes all its tests, but leaks memory like a sieve:
c:\Perl\packages\Thread-Pool-Simple-0.23>nmake test
Microsoft (R) Program Maintenance Utility Version 7.00.9466
Copyright (C) Microsoft Corporation. All rights reserved.
C:\Perl\bin\perl.exe "-MExtUtils::Command::MM" "-e" "test_harn
+ess(0, 'blib\lib', 'blib\arch')" t\1.t t\2.t t\3.t t\4.
t\1....ok 1/0Scalars leaked: 1
Scalars leaked: 1
Scalars leaked: 1
Scalars leaked: 1
Scalars leaked: 1
Scalars leaked: 1
Scalars leaked: 1
Scalars leaked: 1
Scalars leaked: 1
Scalars leaked: 1
t\1....ok
t\2....ok 901/0Scalars leaked: 1
Scalars leaked: 1
Scalars leaked: 1
Scalars leaked: 1
Scalars leaked: 1
t\2....ok
t\3....ok 901/0Scalars leaked: 1
Scalars leaked: 1
Scalars leaked: 1
Scalars leaked: 1
Scalars leaked: 1
Scalars leaked: 1
Scalars leaked: 1
Scalars leaked: 1
Scalars leaked: 1
Scalars leaked: 1
Scalars leaked: 1
Scalars leaked: 1
Scalars leaked: 1
Scalars leaked: 1
Scalars leaked: 1
Scalars leaked: 1
Scalars leaked: 1
Scalars leaked: 1
Scalars leaked: 1
Scalars leaked: 1
t\3....ok
t\4....ok 2/0Scalars leaked: 1
t\4....ok
All tests successful.
Files=4, Tests=1805, 20 wallclock secs ( 0.00 cusr + 0.00 csys = 0.0
+0 CPU)
Which wouldn't be so bad if T::P::Simple was the only module affected, but with the latest versions of threads & threads::shared, programs that ran completely clean with 5.8.6 and its distributed version of threads/threads::shared, now leak like a sieve also..
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
|
Im not sure what's different in our test environments, but I seem to be having no problems in any of the tests for T::P::S
This is what I get:
D:\Perl\MSWin32-x86-multi-thread-5.8\Thread-Pool-Simple-0.23>nmake tes
+t
Microsoft (R) Program Maintenance Utility Version 1.50
Copyright (c) Microsoft Corp 1988-94. All rights reserved.
D:\Perl\bin\perl.exe "-MExtUtils::Command::MM" "-e" "test_harn
+ess(0, 'blib\lib', 'blib\arch')" t/*.t
t/1....ok
t/2....ok
t/3....ok
t/4....ok
All tests successful.
Files=4, Tests=1805, 16 wallclock secs ( 0.00 cusr + 0.00 csys = 0.0
+0 CPU)
| [reply] [d/l] |
Re: Multithreaded Script CPU Usage
by NiJo (Friar) on Aug 26, 2008 at 19:53 UTC
|
I see that your approach comes from automating interactive tools. I'm sure that perl can do a lot simpler, better and faster.
A disk (with one disk head) is basically a single threaded application. Trying to keep 10 threads happy results in a lot of seeks and very slow performance. Sometimes it is OK to have 2 threads: one waiting for a disk seek and the other parsing a different file.
I'd redesign the application to be pure perl and single threaded. If you need to scan several disks, I'd start one scanner per disk and directly feed into the database using DBI.
The following "code" is just for showing the perl way to do it with prebuild modules. It is mostly copied from perldoc, won't run out of the box and has no error handling. I could not find a good way to get author information from MS office files. Win32::OLE and some digging into MS Explorer OLE should help.
use strict;
use warnings;
use File::Find;
use DBI;
$dsn = "DBI:mysql:database=$database;host=$hostname;port=$port";
$dbh = DBI->connect($dsn, $user, $password);
$sth = $dbh->prepare("INSERT INTO table(foo,bar,baz) VALUES (?,?,?)");
sub wanted {
my @stats = stat $_;
$sth->execute( $foo, $bar, $baz );
}
find(\&wanted, @directories_to_search);
P.S: Basically you have reinvented the unix 'locate' tool and ignored Microsofts indexing files for faster search. | [reply] [d/l] |
|
I initially started off exactly the way you're saying, but file::find and stat() was just too slow. FileList is a lot faster (no clue why). I need the entire scan to finish in one weekend, and there can be up to 80 million files to scan. That's also why I went for multithreading and etc. Im trying to squeeze any juice I can out of the resources I have to get this script to run in its allotted window.
As for your one scanner per disk suggestion, I don't think starting 300 scanners is the answer here. :)
I know it's ugly, and part of me weeps somewhere deep inside while I write this stuff, but requirements are making me do it. :(
| [reply] |
|
It took me some days off and rethinking to guess at your real bottleneck. Did you forget to tell us about the low bandwitdth remote windows network shares you are scanning? 300 local disks is too much to believe. Then collect the data locally into files and transfer them to the server. The parallel scanning should take about a minute for these file systems just above desktop size.
But after solving all the scanning performance issues you are finally bound by the insert limit of your database. 80 million records are very huge. 1000 inserts per second would be my estimate for mysql on a desktop. That is 22 hours in your case. But do your own benchmark for multiple and faster CPUs and disks.
Back to the drawing board you need to eliminate the overhead of the database. We don't know anything about use cases of your database output. When it is just finding a file once per week, I'd cat the local output files together and grep the 800 MB text file at disk speed in less than 10 seconds. In more advanced cases Perl regexps should have lower overhead than SQL on queries and no insert limit bottleneck.
| [reply] |
|
Re: Multithreaded Script CPU Usage
by Zenshai (Sexton) on Aug 26, 2008 at 13:52 UTC
|
Thanks, everyone.
I think what I am going to do is split this script in 2, the threaded script will only launch the scans, and I'll have another script, on another box monitor the directory where the temp files are placed, process them, import them, and delete them.
Hopefully the memory leak BrowserUk describe won't get bad enough to kill my Perl process. Otherwise, expect another long winded post from me about forking or somesuch. :) | [reply] |
|
Just a few comments:
1) You should not need 2 scripts. If your initial script did not create threads, but instead began scans using 'system' and putting the process in background, then the same script can monitor the output files and start new FileList executions as previous ones end.
2) There is still the issue of what the 2 mysql functions really do. You might post the 'fix-the-file' function and the 'load data' function code if they are not too large. Either of these could also be taking lots of CPU. Does your mysql_load_data create bulk inserts or individual inserts? bulk inserts are much faster, both for the client and the server
| [reply] |
|
From what I can see, the functions you listed should be fine. My point about the single script went back to the initial problem. Your initial problem was that the main perl script was taking 75% CPU. As was pointed out, this was likely caused by the thread module itself. If you modified your script as I suggested, the perl script should end up using very little CPU, and could therefore run fine together on a single system. Of course, your mileage may vary...
| [reply] |
|
|
The reason I am going to try spliting the script across 2 machines is because I was thinking the same thing as Perlbotics above. There is just too much load on the resources if I try to do everything at the same time. By splitting the global task in 2 parts I can either run both concurrently on 2 machines or in sequence on 1, depending on my time constraints.
But just in case, here is the rest of the code you were asking for. Am I doing something horribly wrong there that is causing needless load on the CPU?
The Load data function just executes this sql:
my $sql = qq{ LOAD DATA LOCAL INFILE \"$mysql_tmp_filename\"
INTO TABLE `$tblname`
FIELDS TERMINATED BY ','
OPTIONALLY enclosed by '\"'
ESCAPED BY '\\\\'
LINES TERMINATED BY '\\r\\n'
IGNORE 3 LINES
(file_name,\@file_p,file_size,\@file_la,\@file_lc,\@file_c, file_exten
+sion)
set file_last_access = STR_TO_DATE(\@file_la, '%c/%e/%Y %l:%i %p'),
file_path = \@file_p,
file_share_name = substring_index(substring_index(\@file_p,'\\\\',3),'
+\\\\',-1),
file_last_change = STR_TO_DATE(\@file_lc, '%c/%e/%Y %l:%i %p'),
file_creation = STR_TO_DATE(\@file_c, '%c/%e/%Y %l:%i %p')
};
my $sth = $dbh->prepare($sql);
$sth->execute();
and here is fix_file_4mysql:
sub fix_file_4mysql { #arguments: 1) path to file
#returns: 1) the filename fixed for use in a mysql query
#fix the contents of the temp file for use in mysql
my $tmp_filename = shift;
open( IN, "+<$tmp_filename" );
my @file = <IN>;
seek IN, 0, 0;
foreach my $file (@file) {
$file =~ s/\\/\\\\/g;
print IN $file;
}
close IN;
#fix temp_filename for use in mysql
my $mysql_tmp_filename = $tmp_filename;
$mysql_tmp_filename =~ s/\\/\\\\/g;
return $mysql_tmp_filename;
}
| [reply] [d/l] [select] |
|
Well, there goes that theory.
I remade the script as I specified above, and while its working exactly as intended, the original problem still remains. Perl CPU usage is still too high when it should be idle.
I've found that I can give the FileList processes some more CPU room by setting the Perl process's priority to Below Normal or Low, but that's not really solving the problem. Its still needlessly using resources that can be better spent doing other things, or even being idle.
I'm going to think about some other way to do this, without threads.
| [reply] |