Hi Monks,
I'm using
Net::Amazon::S3 to get a listing of a bunch of files (under 30K files right now, but soon to be around 150K). I only want to grab the number of files/dirs and add up the bytes these take up..
It works fine but takes several minutes to run, during which time it appears to take up about 46MB of RAM (I have 4GB on this box). But the CPU gets slammed at 100% the whole time (actually, one core gets pegged).
here's the only loop I have, that adds up the bytes. The module builds an array as a value inside a hash (I believe) and it also uses LWP and XML modules among others behind the scenes (I believe)
my $bytes_used = 0;
foreach my $key ( @{ $response->{keys} } ) {
$bytes_used += $key->{size};
}
my $num_keys = commify($#{ $response->{keys} });
$bytes_used = commify($bytes_used);
...
the answer looks like this:
29,118 keys in bucket bla.
98,524,002,052 total bytes used in bucket bla.
Do you think this process can be made less CPU-intensive somehow?
It seems as if the module is going to build the answer list in an array before you get the chance to do anything else like either keep it in memory or write it out to a file.
Oh yeah one thing - the files are in pretty deep directory structures - perhaps xml parsing is the culprit for CPU usage due to the many nested levels - how would I confirm this?
Many thanks,
Harold
Update:Thanks
kyle for the suggestion to profile, which I did, and looks like my suspicion that XML related acitvities take most of the cycles here might be confirmed:
> dprofpp
Total Elapsed Time = 2111.4 Seconds
User+System Time = 2029.76 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
48.6 987.9 987.94 469469 0.0021 0.0021 XML::LibXML::NodeList::ne
+w
48.2 979.3 979.37 469402 0.0021 0.0021 XML::LibXML::Literal::new
0.75 15.27 15.270 469402 0.0000 0.0000 XML::LibXML::XPathContext
+::_find
0.50 10.25 29.390 469536 0.0000 0.0001 XML::LibXML::XPathContext
+::_guarde
d_find_call
0.42 8.430 992.83 469402 0.0000 0.0021 XML::LibXML::NodeList::to
+_literal
0.35 7.130 1024.0 469402 0.0000 0.0022 XML::LibXML::XPathContext
+::find
0.34 6.820 2026.5 469402 0.0000 0.0043 XML::LibXML::XPathContext
+::findval
10 ue
0.25 5.030 5.030 335335 0.0000 0.0000 XML::LibXML::Node::string
+_value
0.24 4.840 2034.3 68 0.0712 29.916 Net::Amazon::S3::list_buc
+ket
0.14 2.820 2.820 469402 0.0000 0.0000 XML::LibXML::Literal::val
+ue
0.07 1.420 1.420 469000 0.0000 0.0000 XML::LibXML::XPathContext
+::getCont
extNode
0.06 1.310 1.310 938000 0.0000 0.0000 XML::LibXML::XPathContext
+::setCont
extNode
0.05 0.930 0.930 402402 0.0000 0.0000 XML::LibXML::Node::DESTRO
+Y
0.04 0.890 0.890 469536 0.0000 0.0000 XML::LibXML::XPathContext
+::_free_n
ode_pool
0.02 0.360 0.360 68 0.0053 0.0053 XML::LibXML::_parse_strin
+g
So, this ran for about 35 minutes, and unfortunately crapped out with a parser error
:2: parser error : xmlParseCharRef: invalid xmlChar value 8
I'm going to assume this is because the script ran while a file upload was taking place, and some of the returned records might not have been complete.
Profiling the script pointed at another much smaller Amazon bucket, however, yeilds the same proportion of results -- that is -- XML::LibXML::NodeList::new and XML::LibXML::Literal::new each take 48% or more of the runtime...
So this brings me back to the original question:) - can any kind soul suggest any way to improve performance -- using threads would not enable me to put the idle CPU core to use would it? Or...
Thanks once again
-Harold
Update 2: Someone suggested changing from a
foreach to a
while in my function but as can be seen from the profiling run in the OP, most of the time (and also CPU, I would guess) is being spent within the XML modules used by the AmazonS3 module to build the data structure.
The function with the foreach loop isn't even called until the AmazonS3 module finishes getting and building its data, and doesn't even appear in the top 15 functions
-H
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.