Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re: Module uses loads of CPU.. or is it me

by redhotpenguin (Deacon)
on Dec 11, 2007 at 01:04 UTC ( [id://656289]=note: print w/replies, xml ) Need Help??


in reply to Module uses loads of CPU.. or is it me

Can you show us a bit more of the code that calls LibXML? I don't know if you are using a streaming or DOM parser. It's obvious where the bottleneck is but I think seeing some more of the code would help diagnose the problem. If there is anything proprietary you can't show us then leave those parts out.
  • Comment on Re: Module uses loads of CPU.. or is it me

Replies are listed 'Best First'.
Re^2: Module uses loads of CPU.. or is it me
by hsinclai (Deacon) on Dec 11, 2007 at 01:36 UTC
    My apologies, of course, here's the script.. it's my first test just to see how well the module worked... Any XML related stuff is being called by Net::Amazon::S3 behind the scenes.

    Maybe I'm missing something obvious (hope not)?

    #!/usr/bin/perl use strict; use warnings; use Net::Amazon::S3; my $aws_access_key_id = 'XXXXXXXXXXXXXXXXXXXX'; my $aws_secret_access_key = 'xxxxxxxxxxxxxxxxxxxx'; my $chosen_bucket = $ARGV[0] || 'default_bucketname'; my $bytes_used = 0; my $s3 = Net::Amazon::S3->new( { aws_access_key_id => $aws_access_key_id, aws_secret_access_key => $aws_secret_access_key } ); my $bucket_now = $s3->bucket($chosen_bucket); my $response = $bucket_now->list_all or die $s3->err . ": " . $s3 +->errstr; &byte_counter; my $num_keys = commify($#{ $response->{keys} }); print $num_keys . " keys in bucket $chosen_bucket." . $/; $bytes_used = commify($bytes_used); print $bytes_used . " total bytes used in bucket $chosen_bucket." . $/ +; #--- sub byte_counter { foreach my $key ( @{ $response->{keys} } ) { $bytes_used += $key->{size}; } } sub commify { my $text = reverse $_[0]; $text =~ s/(\d\d\d)(?=\d)(?!\d*\.)/$1,/g; return scalar reverse $text; }


    Note that Amazon's answers come back in XML which is why the XML stuff is needed...

    -Harold

      Ah I see, the bottleneck is in Net::S3::Amazon, which appears to be using an XPath approach to get at the needed information. Looks like list_all calls list_bucket_all, which calls list_bucket, which does the XPath dirty work.

      If I were in your shoes, I might try to write an alternative sub to list_bucket which uses an approach other than XPath. If you look in:

      http://search.cpan.org/src/PAJAS/XML-LibXML-1.65/lib/XML/LibXML/XPathC +ontext.pm

      sub find calls new for each node it needs to find:

      sub find { my ($self, $xpath, $node) = @_; my ($type, @params) = $self->_guarded_find_call('_find', $xpath, $ +node); if ($type) { return $type->new(@params); } return undef; }

      This is where the OO interface of XML::LibXML::XPathContext is your bottleneck. You could probably develop a faster interface using a streaming parser, but how much faster I don't know. You'll need some sort of optimization in there to get a faster result. Sorry I can't be of much more help.

      UPDATE - you could also use some of the modules lower level functions, and attempt to parallelize the operation by having each cpu count x percent of the buckets. That's probably easier than speeding up the xml parser, and I think it is probably your best bet to get a two or more times speedup. You could fork off a process that writes the results to a temp file, and then add up all the results at the end. I think that may be your shortest course to victory.

        Wow thanks for digging this deep to find this problem! That is awesome.

        And thanks also for the suggestions, though writing a new XML parsing tool seems a little much (if not a little bit daunting too:), and not knowing what the actual return will be makes me wonder if it's worth it in this case.

        Now I wonder if this exact issue has not been encountered in other XML applications, and if so, how it was improved.

        Thanks again,

        -H

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://656289]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (3)
As of 2024-04-20 01:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found