Module uses loads of CPU.. or is it me

hsinclai has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I'm using Net::Amazon::S3 to get a listing of a bunch of files (under 30K files right now, but soon to be around 150K). I only want to grab the number of files/dirs and add up the bytes these take up..

It works fine but takes several minutes to run, during which time it appears to take up about 46MB of RAM (I have 4GB on this box). But the CPU gets slammed at 100% the whole time (actually, one core gets pegged).

here's the only loop I have, that adds up the bytes. The module builds an array as a value inside a hash (I believe) and it also uses LWP and XML modules among others behind the scenes (I believe)

my $bytes_used = 0;

foreach my $key ( @{ $response->{keys} } ) {
  $bytes_used += $key->{size};
}

my $num_keys = commify($#{ $response->{keys} });
$bytes_used  = commify($bytes_used);
...
[download]

the answer looks like this:

29,118            keys in bucket bla.
98,524,002,052    total bytes used in bucket bla.
[download]

Do you think this process can be made less CPU-intensive somehow?

It seems as if the module is going to build the answer list in an array before you get the chance to do anything else like either keep it in memory or write it out to a file.

Oh yeah one thing - the files are in pretty deep directory structures - perhaps xml parsing is the culprit for CPU usage due to the many nested levels - how would I confirm this?

Many thanks,

Harold

Update:

Thanks kyle for the suggestion to profile, which I did, and looks like my suspicion that XML related acitvities take most of the cycles here might be confirmed:

 > dprofpp
Total Elapsed Time =   2111.4 Seconds
  User+System Time =  2029.76 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
 48.6   987.9 987.94 469469   0.0021 0.0021  XML::LibXML::NodeList::ne
+w
 48.2   979.3 979.37 469402   0.0021 0.0021  XML::LibXML::Literal::new
 0.75   15.27 15.270 469402   0.0000 0.0000  XML::LibXML::XPathContext
+::_find
 0.50   10.25 29.390 469536   0.0000 0.0001  XML::LibXML::XPathContext
+::_guarde
                                             d_find_call
 0.42   8.430 992.83 469402   0.0000 0.0021  XML::LibXML::NodeList::to
+_literal
 0.35   7.130 1024.0 469402   0.0000 0.0022  XML::LibXML::XPathContext
+::find
 0.34   6.820 2026.5 469402   0.0000 0.0043  XML::LibXML::XPathContext
+::findval
                  10                         ue
 0.25   5.030  5.030 335335   0.0000 0.0000  XML::LibXML::Node::string
+_value
 0.24   4.840 2034.3     68   0.0712 29.916  Net::Amazon::S3::list_buc
+ket
 0.14   2.820  2.820 469402   0.0000 0.0000  XML::LibXML::Literal::val
+ue
 0.07   1.420  1.420 469000   0.0000 0.0000  XML::LibXML::XPathContext
+::getCont
                                             extNode
 0.06   1.310  1.310 938000   0.0000 0.0000  XML::LibXML::XPathContext
+::setCont
                                             extNode
 0.05   0.930  0.930 402402   0.0000 0.0000  XML::LibXML::Node::DESTRO
+Y
 0.04   0.890  0.890 469536   0.0000 0.0000  XML::LibXML::XPathContext
+::_free_n
                                             ode_pool
 0.02   0.360  0.360     68   0.0053 0.0053  XML::LibXML::_parse_strin
+g
[download]

So, this ran for about 35 minutes, and unfortunately crapped out with a parser error
:2: parser error : xmlParseCharRef: invalid xmlChar value 8
I'm going to assume this is because the script ran while a file upload was taking place, and some of the returned records might not have been complete.

Profiling the script pointed at another much smaller Amazon bucket, however, yeilds the same proportion of results -- that is -- XML::LibXML::NodeList::new and XML::LibXML::Literal::new each take 48% or more of the runtime...

So this brings me back to the original question:) - can any kind soul suggest any way to improve performance -- using threads would not enable me to put the idle CPU core to use would it? Or...

Thanks once again -Harold

Update 2: Someone suggested changing from a foreach to a while in my function but as can be seen from the profiling run in the OP, most of the time (and also CPU, I would guess) is being spent within the XML modules used by the AmazonS3 module to build the data structure.

The function with the foreach loop isn't even called until the AmazonS3 module finishes getting and building its data, and doesn't even appear in the top 15 functions

-H

Comment on Module uses loads of CPU.. or is it me Select or Download Code

Replies are listed 'Best First'.
Re: Module uses loads of CPU.. or is it me by kyle (Abbot) on Dec 10, 2007 at 03:11 UTC
perhaps xml parsing is the culprit for CPU usage due to the many nested levels - how would I confirm this? You could confirm that through profiling. See Profiling your code. P. S. — I'd be interested to hear what the results say.	[reply]
Re^2: Module uses loads of CPU.. or is it me by Anonymous Monk on Dec 10, 2007 at 22:10 UTC
Hi hsinclai, Have you considered using a while loop as opposed to your foreach? If I remember correctly, a foreach loop is more intensive as it requires an underlying structure to be created to iterate. A while loop does not need to do that and is generally a little faster. Maybe give that a shot.	[reply]
Re^3: Module uses loads of CPU.. or is it me by hsinclai (Deacon) on Dec 11, 2007 at 02:45 UTC
Please see my update2 in OP -Harold	[reply] [d/l]
Re: Module uses loads of CPU.. or is it me by redhotpenguin (Deacon) on Dec 11, 2007 at 01:04 UTC
Can you show us a bit more of the code that calls LibXML? I don't know if you are using a streaming or DOM parser. It's obvious where the bottleneck is but I think seeing some more of the code would help diagnose the problem. If there is anything proprietary you can't show us then leave those parts out.	[reply]
Re^2: Module uses loads of CPU.. or is it me by hsinclai (Deacon) on Dec 11, 2007 at 01:36 UTC
My apologies, of course, here's the script.. it's my first test just to see how well the module worked... Any XML related stuff is being called by Net::Amazon::S3 behind the scenes. Maybe I'm missing something obvious (hope not)? #!/usr/bin/perl use strict; use warnings; use Net::Amazon::S3; my $aws_access_key_id = 'XXXXXXXXXXXXXXXXXXXX'; my $aws_secret_access_key = 'xxxxxxxxxxxxxxxxxxxx'; my $chosen_bucket = $ARGV[0] \|\| 'default_bucketname'; my $bytes_used = 0; my $s3 = Net::Amazon::S3->new( { aws_access_key_id => $aws_access_key_id, aws_secret_access_key => $aws_secret_access_key } ); my $bucket_now = $s3->bucket($chosen_bucket); my $response = $bucket_now->list_all or die $s3->err . ": " . $s3 +->errstr; &byte_counter; my $num_keys = commify($#{ $response->{keys} }); print $num_keys . " keys in bucket $chosen_bucket." . $/; $bytes_used = commify($bytes_used); print $bytes_used . " total bytes used in bucket $chosen_bucket." . $/ +; #--- sub byte_counter { foreach my $key ( @{ $response->{keys} } ) { $bytes_used += $key->{size}; } } sub commify { my $text = reverse $_[0]; $text =~ s/(\d\d\d)(?=\d)(?!\d*\.)/$1,/g; return scalar reverse $text; } [download] Note that Amazon's answers come back in XML which is why the XML stuff is needed... -Harold	[reply] [d/l]
Re^3: Module uses loads of CPU.. or is it me by redhotpenguin (Deacon) on Dec 11, 2007 at 05:48 UTC
Ah I see, the bottleneck is in Net::S3::Amazon, which appears to be using an XPath approach to get at the needed information. Looks like list_all calls list_bucket_all, which calls list_bucket, which does the XPath dirty work. If I were in your shoes, I might try to write an alternative sub to list_bucket which uses an approach other than XPath. If you look in: `http://search.cpan.org/src/PAJAS/XML-LibXML-1.65/lib/XML/LibXML/XPathC +ontext.pm` [download] sub find calls new for each node it needs to find: `sub find { my ($self, $xpath, $node) = @_; my ($type, @params) = $self->_guarded_find_call('_find', $xpath, $ +node); if ($type) { return $type->new(@params); } return undef; }` [download] This is where the OO interface of XML::LibXML::XPathContext is your bottleneck. You could probably develop a faster interface using a streaming parser, but how much faster I don't know. You'll need some sort of optimization in there to get a faster result. Sorry I can't be of much more help. UPDATE - you could also use some of the modules lower level functions, and attempt to parallelize the operation by having each cpu count x percent of the buckets. That's probably easier than speeding up the xml parser, and I think it is probably your best bet to get a two or more times speedup. You could fork off a process that writes the results to a temp file, and then add up all the results at the end. I think that may be your shortest course to victory.	[reply] [d/l] [select]
Re^4: Module uses loads of CPU.. or is it me by hsinclai (Deacon) on Dec 11, 2007 at 13:34 UTC
Re^5: Module uses loads of CPU.. or is it me by redhotpenguin (Deacon) on Dec 11, 2007 at 17:22 UTC
Some notes below your chosen depth have not been shown here


go ahead... be a heretic
	PerlMonks