in reply to Re^2: Module uses loads of CPU.. or is it me
in thread Module uses loads of CPU.. or is it me

Ah I see, the bottleneck is in Net::S3::Amazon, which appears to be using an XPath approach to get at the needed information. Looks like list_all calls list_bucket_all, which calls list_bucket, which does the XPath dirty work.

If I were in your shoes, I might try to write an alternative sub to list_bucket which uses an approach other than XPath. If you look in:

http://search.cpan.org/src/PAJAS/XML-LibXML-1.65/lib/XML/LibXML/XPathC +ontext.pm

sub find calls new for each node it needs to find:

sub find { my ($self, $xpath, $node) = @_; my ($type, @params) = $self->_guarded_find_call('_find', $xpath, $ +node); if ($type) { return $type->new(@params); } return undef; }

This is where the OO interface of XML::LibXML::XPathContext is your bottleneck. You could probably develop a faster interface using a streaming parser, but how much faster I don't know. You'll need some sort of optimization in there to get a faster result. Sorry I can't be of much more help.

UPDATE - you could also use some of the modules lower level functions, and attempt to parallelize the operation by having each cpu count x percent of the buckets. That's probably easier than speeding up the xml parser, and I think it is probably your best bet to get a two or more times speedup. You could fork off a process that writes the results to a temp file, and then add up all the results at the end. I think that may be your shortest course to victory.

Replies are listed 'Best First'.
Re^4: Module uses loads of CPU.. or is it me
by hsinclai (Deacon) on Dec 11, 2007 at 13:34 UTC
    Wow thanks for digging this deep to find this problem! That is awesome.

    And thanks also for the suggestions, though writing a new XML parsing tool seems a little much (if not a little bit daunting too:), and not knowing what the actual return will be makes me wonder if it's worth it in this case.

    Now I wonder if this exact issue has not been encountered in other XML applications, and if so, how it was improved.

    Thanks again,

    -H

      Well I think you should take a serious look at some of the lower level methods in Net::S3::Amazon and try to develop a parallelized application. It isn't likely that you will be able to increase the efficiency of the parser by double, but I think with a few hours hacking you could get a parallelized version of your program that you can make Nx speedups with.

        Without hacking too deeply, some parallelization might be acheived by using the "marker" method already in the module itself, and having parallel fetches in batches. Looking further into the module I think I found an error or bug (maybe it's just a documentation bug) http://rt.cpan.org/Public/Bug/Display.html?id=31381.

        Going much further is going to be a question of (my) available time right now -- but I really do appreciate the push!

        -Harold