Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Module uses loads of CPU.. or is it me

by hsinclai (Deacon)
on Dec 10, 2007 at 02:20 UTC ( [id://656032]=perlquestion: print w/replies, xml ) Need Help??

hsinclai has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I'm using Net::Amazon::S3 to get a listing of a bunch of files (under 30K files right now, but soon to be around 150K). I only want to grab the number of files/dirs and add up the bytes these take up..

It works fine but takes several minutes to run, during which time it appears to take up about 46MB of RAM (I have 4GB on this box). But the CPU gets slammed at 100% the whole time (actually, one core gets pegged).

here's the only loop I have, that adds up the bytes. The module builds an array as a value inside a hash (I believe) and it also uses LWP and XML modules among others behind the scenes (I believe)
my $bytes_used = 0; foreach my $key ( @{ $response->{keys} } ) { $bytes_used += $key->{size}; } my $num_keys = commify($#{ $response->{keys} }); $bytes_used = commify($bytes_used); ...
the answer looks like this:
29,118 keys in bucket bla. 98,524,002,052 total bytes used in bucket bla.
Do you think this process can be made less CPU-intensive somehow?

It seems as if the module is going to build the answer list in an array before you get the chance to do anything else like either keep it in memory or write it out to a file.

Oh yeah one thing - the files are in pretty deep directory structures - perhaps xml parsing is the culprit for CPU usage due to the many nested levels - how would I confirm this?

Many thanks,

Harold


Update:

Thanks kyle for the suggestion to profile, which I did, and looks like my suspicion that XML related acitvities take most of the cycles here might be confirmed:
> dprofpp Total Elapsed Time = 2111.4 Seconds User+System Time = 2029.76 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 48.6 987.9 987.94 469469 0.0021 0.0021 XML::LibXML::NodeList::ne +w 48.2 979.3 979.37 469402 0.0021 0.0021 XML::LibXML::Literal::new 0.75 15.27 15.270 469402 0.0000 0.0000 XML::LibXML::XPathContext +::_find 0.50 10.25 29.390 469536 0.0000 0.0001 XML::LibXML::XPathContext +::_guarde d_find_call 0.42 8.430 992.83 469402 0.0000 0.0021 XML::LibXML::NodeList::to +_literal 0.35 7.130 1024.0 469402 0.0000 0.0022 XML::LibXML::XPathContext +::find 0.34 6.820 2026.5 469402 0.0000 0.0043 XML::LibXML::XPathContext +::findval 10 ue 0.25 5.030 5.030 335335 0.0000 0.0000 XML::LibXML::Node::string +_value 0.24 4.840 2034.3 68 0.0712 29.916 Net::Amazon::S3::list_buc +ket 0.14 2.820 2.820 469402 0.0000 0.0000 XML::LibXML::Literal::val +ue 0.07 1.420 1.420 469000 0.0000 0.0000 XML::LibXML::XPathContext +::getCont extNode 0.06 1.310 1.310 938000 0.0000 0.0000 XML::LibXML::XPathContext +::setCont extNode 0.05 0.930 0.930 402402 0.0000 0.0000 XML::LibXML::Node::DESTRO +Y 0.04 0.890 0.890 469536 0.0000 0.0000 XML::LibXML::XPathContext +::_free_n ode_pool 0.02 0.360 0.360 68 0.0053 0.0053 XML::LibXML::_parse_strin +g
So, this ran for about 35 minutes, and unfortunately crapped out with a parser error
:2: parser error : xmlParseCharRef: invalid xmlChar value 8
I'm going to assume this is because the script ran while a file upload was taking place, and some of the returned records might not have been complete.

Profiling the script pointed at another much smaller Amazon bucket, however, yeilds the same proportion of results -- that is -- XML::LibXML::NodeList::new and XML::LibXML::Literal::new each take 48% or more of the runtime...

So this brings me back to the original question:) - can any kind soul suggest any way to improve performance -- using threads would not enable me to put the idle CPU core to use would it? Or...

Thanks once again -Harold

Update 2: Someone suggested changing from a foreach to a while in my function but as can be seen from the profiling run in the OP, most of the time (and also CPU, I would guess) is being spent within the XML modules used by the AmazonS3 module to build the data structure.

The function with the foreach loop isn't even called until the AmazonS3 module finishes getting and building its data, and doesn't even appear in the top 15 functions

-H

Replies are listed 'Best First'.
Re: Module uses loads of CPU.. or is it me
by kyle (Abbot) on Dec 10, 2007 at 03:11 UTC

    perhaps xml parsing is the culprit for CPU usage due to the many nested levels - how would I confirm this?

    You could confirm that through profiling. See Profiling your code.

    P. S. — I'd be interested to hear what the results say.

      Hi hsinclai,

      Have you considered using a while loop as opposed to your foreach?

      If I remember correctly, a foreach loop is more intensive as it requires an underlying structure to be created to iterate. A while loop does not need to do that and is generally a little faster.

      Maybe give that a shot.

        Please see my update2 in OP

        -Harold

Re: Module uses loads of CPU.. or is it me
by redhotpenguin (Deacon) on Dec 11, 2007 at 01:04 UTC
    Can you show us a bit more of the code that calls LibXML? I don't know if you are using a streaming or DOM parser. It's obvious where the bottleneck is but I think seeing some more of the code would help diagnose the problem. If there is anything proprietary you can't show us then leave those parts out.
      My apologies, of course, here's the script.. it's my first test just to see how well the module worked... Any XML related stuff is being called by Net::Amazon::S3 behind the scenes.

      Maybe I'm missing something obvious (hope not)?

      #!/usr/bin/perl use strict; use warnings; use Net::Amazon::S3; my $aws_access_key_id = 'XXXXXXXXXXXXXXXXXXXX'; my $aws_secret_access_key = 'xxxxxxxxxxxxxxxxxxxx'; my $chosen_bucket = $ARGV[0] || 'default_bucketname'; my $bytes_used = 0; my $s3 = Net::Amazon::S3->new( { aws_access_key_id => $aws_access_key_id, aws_secret_access_key => $aws_secret_access_key } ); my $bucket_now = $s3->bucket($chosen_bucket); my $response = $bucket_now->list_all or die $s3->err . ": " . $s3 +->errstr; &byte_counter; my $num_keys = commify($#{ $response->{keys} }); print $num_keys . " keys in bucket $chosen_bucket." . $/; $bytes_used = commify($bytes_used); print $bytes_used . " total bytes used in bucket $chosen_bucket." . $/ +; #--- sub byte_counter { foreach my $key ( @{ $response->{keys} } ) { $bytes_used += $key->{size}; } } sub commify { my $text = reverse $_[0]; $text =~ s/(\d\d\d)(?=\d)(?!\d*\.)/$1,/g; return scalar reverse $text; }


      Note that Amazon's answers come back in XML which is why the XML stuff is needed...

      -Harold

        Ah I see, the bottleneck is in Net::S3::Amazon, which appears to be using an XPath approach to get at the needed information. Looks like list_all calls list_bucket_all, which calls list_bucket, which does the XPath dirty work.

        If I were in your shoes, I might try to write an alternative sub to list_bucket which uses an approach other than XPath. If you look in:

        http://search.cpan.org/src/PAJAS/XML-LibXML-1.65/lib/XML/LibXML/XPathC +ontext.pm

        sub find calls new for each node it needs to find:

        sub find { my ($self, $xpath, $node) = @_; my ($type, @params) = $self->_guarded_find_call('_find', $xpath, $ +node); if ($type) { return $type->new(@params); } return undef; }

        This is where the OO interface of XML::LibXML::XPathContext is your bottleneck. You could probably develop a faster interface using a streaming parser, but how much faster I don't know. You'll need some sort of optimization in there to get a faster result. Sorry I can't be of much more help.

        UPDATE - you could also use some of the modules lower level functions, and attempt to parallelize the operation by having each cpu count x percent of the buckets. That's probably easier than speeding up the xml parser, and I think it is probably your best bet to get a two or more times speedup. You could fork off a process that writes the results to a temp file, and then add up all the results at the end. I think that may be your shortest course to victory.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://656032]
Approved by graff
Front-paged by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (7)
As of 2024-04-19 08:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found