Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re^2: Module uses loads of CPU.. or is it me

by hsinclai (Deacon)
on Dec 11, 2007 at 01:36 UTC ( [id://656292]=note: print w/replies, xml ) Need Help??


in reply to Re: Module uses loads of CPU.. or is it me
in thread Module uses loads of CPU.. or is it me

My apologies, of course, here's the script.. it's my first test just to see how well the module worked... Any XML related stuff is being called by Net::Amazon::S3 behind the scenes.

Maybe I'm missing something obvious (hope not)?

#!/usr/bin/perl use strict; use warnings; use Net::Amazon::S3; my $aws_access_key_id = 'XXXXXXXXXXXXXXXXXXXX'; my $aws_secret_access_key = 'xxxxxxxxxxxxxxxxxxxx'; my $chosen_bucket = $ARGV[0] || 'default_bucketname'; my $bytes_used = 0; my $s3 = Net::Amazon::S3->new( { aws_access_key_id => $aws_access_key_id, aws_secret_access_key => $aws_secret_access_key } ); my $bucket_now = $s3->bucket($chosen_bucket); my $response = $bucket_now->list_all or die $s3->err . ": " . $s3 +->errstr; &byte_counter; my $num_keys = commify($#{ $response->{keys} }); print $num_keys . " keys in bucket $chosen_bucket." . $/; $bytes_used = commify($bytes_used); print $bytes_used . " total bytes used in bucket $chosen_bucket." . $/ +; #--- sub byte_counter { foreach my $key ( @{ $response->{keys} } ) { $bytes_used += $key->{size}; } } sub commify { my $text = reverse $_[0]; $text =~ s/(\d\d\d)(?=\d)(?!\d*\.)/$1,/g; return scalar reverse $text; }


Note that Amazon's answers come back in XML which is why the XML stuff is needed...

-Harold

Replies are listed 'Best First'.
Re^3: Module uses loads of CPU.. or is it me
by redhotpenguin (Deacon) on Dec 11, 2007 at 05:48 UTC

    Ah I see, the bottleneck is in Net::S3::Amazon, which appears to be using an XPath approach to get at the needed information. Looks like list_all calls list_bucket_all, which calls list_bucket, which does the XPath dirty work.

    If I were in your shoes, I might try to write an alternative sub to list_bucket which uses an approach other than XPath. If you look in:

    http://search.cpan.org/src/PAJAS/XML-LibXML-1.65/lib/XML/LibXML/XPathC +ontext.pm

    sub find calls new for each node it needs to find:

    sub find { my ($self, $xpath, $node) = @_; my ($type, @params) = $self->_guarded_find_call('_find', $xpath, $ +node); if ($type) { return $type->new(@params); } return undef; }

    This is where the OO interface of XML::LibXML::XPathContext is your bottleneck. You could probably develop a faster interface using a streaming parser, but how much faster I don't know. You'll need some sort of optimization in there to get a faster result. Sorry I can't be of much more help.

    UPDATE - you could also use some of the modules lower level functions, and attempt to parallelize the operation by having each cpu count x percent of the buckets. That's probably easier than speeding up the xml parser, and I think it is probably your best bet to get a two or more times speedup. You could fork off a process that writes the results to a temp file, and then add up all the results at the end. I think that may be your shortest course to victory.

      Wow thanks for digging this deep to find this problem! That is awesome.

      And thanks also for the suggestions, though writing a new XML parsing tool seems a little much (if not a little bit daunting too:), and not knowing what the actual return will be makes me wonder if it's worth it in this case.

      Now I wonder if this exact issue has not been encountered in other XML applications, and if so, how it was improved.

      Thanks again,

      -H

        Well I think you should take a serious look at some of the lower level methods in Net::S3::Amazon and try to develop a parallelized application. It isn't likely that you will be able to increase the efficiency of the parser by double, but I think with a few hours hacking you could get a parallelized version of your program that you can make Nx speedups with.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://656292]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (3)
As of 2024-04-23 02:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found