comment on

Hi Monks,

I'm using Net::Amazon::S3 to get a listing of a bunch of files (under 30K files right now, but soon to be around 150K). I only want to grab the number of files/dirs and add up the bytes these take up..

It works fine but takes several minutes to run, during which time it appears to take up about 46MB of RAM (I have 4GB on this box). But the CPU gets slammed at 100% the whole time (actually, one core gets pegged).

here's the only loop I have, that adds up the bytes. The module builds an array as a value inside a hash (I believe) and it also uses LWP and XML modules among others behind the scenes (I believe)

my $bytes_used = 0;

foreach my $key ( @{ $response->{keys} } ) {
  $bytes_used += $key->{size};
}

my $num_keys = commify($#{ $response->{keys} });
$bytes_used  = commify($bytes_used);
...
[download]

the answer looks like this:

29,118            keys in bucket bla.
98,524,002,052    total bytes used in bucket bla.
[download]

Do you think this process can be made less CPU-intensive somehow?

It seems as if the module is going to build the answer list in an array before you get the chance to do anything else like either keep it in memory or write it out to a file.

Oh yeah one thing - the files are in pretty deep directory structures - perhaps xml parsing is the culprit for CPU usage due to the many nested levels - how would I confirm this?

Many thanks,

Harold

Update:

Thanks kyle for the suggestion to profile, which I did, and looks like my suspicion that XML related acitvities take most of the cycles here might be confirmed:

 > dprofpp
Total Elapsed Time =   2111.4 Seconds
  User+System Time =  2029.76 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
 48.6   987.9 987.94 469469   0.0021 0.0021  XML::LibXML::NodeList::ne
+w
 48.2   979.3 979.37 469402   0.0021 0.0021  XML::LibXML::Literal::new
 0.75   15.27 15.270 469402   0.0000 0.0000  XML::LibXML::XPathContext
+::_find
 0.50   10.25 29.390 469536   0.0000 0.0001  XML::LibXML::XPathContext
+::_guarde
                                             d_find_call
 0.42   8.430 992.83 469402   0.0000 0.0021  XML::LibXML::NodeList::to
+_literal
 0.35   7.130 1024.0 469402   0.0000 0.0022  XML::LibXML::XPathContext
+::find
 0.34   6.820 2026.5 469402   0.0000 0.0043  XML::LibXML::XPathContext
+::findval
                  10                         ue
 0.25   5.030  5.030 335335   0.0000 0.0000  XML::LibXML::Node::string
+_value
 0.24   4.840 2034.3     68   0.0712 29.916  Net::Amazon::S3::list_buc
+ket
 0.14   2.820  2.820 469402   0.0000 0.0000  XML::LibXML::Literal::val
+ue
 0.07   1.420  1.420 469000   0.0000 0.0000  XML::LibXML::XPathContext
+::getCont
                                             extNode
 0.06   1.310  1.310 938000   0.0000 0.0000  XML::LibXML::XPathContext
+::setCont
                                             extNode
 0.05   0.930  0.930 402402   0.0000 0.0000  XML::LibXML::Node::DESTRO
+Y
 0.04   0.890  0.890 469536   0.0000 0.0000  XML::LibXML::XPathContext
+::_free_n
                                             ode_pool
 0.02   0.360  0.360     68   0.0053 0.0053  XML::LibXML::_parse_strin
+g
[download]

So, this ran for about 35 minutes, and unfortunately crapped out with a parser error
:2: parser error : xmlParseCharRef: invalid xmlChar value 8
I'm going to assume this is because the script ran while a file upload was taking place, and some of the returned records might not have been complete.

Profiling the script pointed at another much smaller Amazon bucket, however, yeilds the same proportion of results -- that is -- XML::LibXML::NodeList::new and XML::LibXML::Literal::new each take 48% or more of the runtime...

So this brings me back to the original question:) - can any kind soul suggest any way to improve performance -- using threads would not enable me to put the idle CPU core to use would it? Or...

Thanks once again -Harold

Update 2: Someone suggested changing from a foreach to a while in my function but as can be seen from the profiling run in the OP, most of the time (and also CPU, I would guess) is being spent within the XML modules used by the AmazonS3 module to build the data structure.

The function with the foreach loop isn't even called until the AmazonS3 module finishes getting and building its data, and doesn't even appear in the top 15 functions

-H

In reply to Module uses loads of CPU.. or is it me by hsinclai

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.