Re^2: Module uses loads of CPU.. or is it me

My apologies, of course, here's the script.. it's my first test just to see how well the module worked... Any XML related stuff is being called by Net::Amazon::S3 behind the scenes.

Maybe I'm missing something obvious (hope not)?

#!/usr/bin/perl

use strict;
use warnings;
use Net::Amazon::S3;


my $aws_access_key_id      =  'XXXXXXXXXXXXXXXXXXXX';
my $aws_secret_access_key  =  'xxxxxxxxxxxxxxxxxxxx';
my $chosen_bucket          =  $ARGV[0] || 'default_bucketname';
my $bytes_used             =  0;





my $s3  =  Net::Amazon::S3->new(
       {
         aws_access_key_id     => $aws_access_key_id,
         aws_secret_access_key => $aws_secret_access_key
       }
);
my $bucket_now    = $s3->bucket($chosen_bucket);
my $response      = $bucket_now->list_all or die $s3->err . ": " . $s3
+->errstr;
&byte_counter;

my $num_keys = commify($#{ $response->{keys} });
print $num_keys . " keys in bucket $chosen_bucket." . $/;

$bytes_used  = commify($bytes_used);
print $bytes_used . " total bytes used in bucket $chosen_bucket." . $/
+;


#---
sub byte_counter
{
  foreach my $key ( @{ $response->{keys} } ) {
    $bytes_used += $key->{size};
  }
}

sub commify {
   my $text = reverse $_[0];
   $text =~ s/(\d\d\d)(?=\d)(?!\d*\.)/$1,/g;
   return scalar reverse $text;
}
[download]

Note that Amazon's answers come back in XML which is why the XML stuff is needed...

-Harold

Comment on Re^2: Module uses loads of CPU.. or is it me Download Code

Replies are listed 'Best First'.
Re^3: Module uses loads of CPU.. or is it me by redhotpenguin (Deacon) on Dec 11, 2007 at 05:48 UTC
Ah I see, the bottleneck is in Net::S3::Amazon, which appears to be using an XPath approach to get at the needed information. Looks like list_all calls list_bucket_all, which calls list_bucket, which does the XPath dirty work. If I were in your shoes, I might try to write an alternative sub to list_bucket which uses an approach other than XPath. If you look in: `http://search.cpan.org/src/PAJAS/XML-LibXML-1.65/lib/XML/LibXML/XPathC +ontext.pm` [download] sub find calls new for each node it needs to find: `sub find { my ($self, $xpath, $node) = @_; my ($type, @params) = $self->_guarded_find_call('_find', $xpath, $ +node); if ($type) { return $type->new(@params); } return undef; }` [download] This is where the OO interface of XML::LibXML::XPathContext is your bottleneck. You could probably develop a faster interface using a streaming parser, but how much faster I don't know. You'll need some sort of optimization in there to get a faster result. Sorry I can't be of much more help. UPDATE - you could also use some of the modules lower level functions, and attempt to parallelize the operation by having each cpu count x percent of the buckets. That's probably easier than speeding up the xml parser, and I think it is probably your best bet to get a two or more times speedup. You could fork off a process that writes the results to a temp file, and then add up all the results at the end. I think that may be your shortest course to victory.	[reply] [d/l] [select]
Re^4: Module uses loads of CPU.. or is it me by hsinclai (Deacon) on Dec 11, 2007 at 13:34 UTC
Wow thanks for digging this deep to find this problem! That is awesome. And thanks also for the suggestions, though writing a new XML parsing tool seems a little much (if not a little bit daunting too:), and not knowing what the actual return will be makes me wonder if it's worth it in this case. Now I wonder if this exact issue has not been encountered in other XML applications, and if so, how it was improved. Thanks again, -H	[reply] [d/l]
Re^5: Module uses loads of CPU.. or is it me by redhotpenguin (Deacon) on Dec 11, 2007 at 17:22 UTC
Well I think you should take a serious look at some of the lower level methods in Net::S3::Amazon and try to develop a parallelized application. It isn't likely that you will be able to increase the efficiency of the parser by double, but I think with a few hours hacking you could get a parallelized version of your program that you can make Nx speedups with.	[reply]
Re^6: Module uses loads of CPU.. or is it me by hsinclai (Deacon) on Dec 11, 2007 at 18:58 UTC


Perl: the Markov chain saw
	PerlMonks