Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Spreadsheet::XLSX memory and speed

by runrig (Abbot)
on Jun 08, 2012 at 19:24 UTC ( [id://975226]=perlquestion: print w/replies, xml ) Need Help??

runrig has asked for the wisdom of the Perl Monks concerning the following question:

I've been trying to see if Spreadsheet::XLSX can be made to not consume so much memory, and I've implemented the CellHandler and NotSetCell attributes from Spreadsheet::ParseExcel, and this helps somewhat, but part of the problem is the extraction of each file in the zip archive to an in memory variable (an xlsx file is a zip archive of many xml files). E.g., one of the worksheet xml files is about 8MB, and the code that parses it like so:
foreach ($member_sheet -> contents =~ /(\<.*?\/?\>|.*?(?=\<))/g) { ...

$member_sheet is an Archive::Zip object for the worksheet xml file and contents() returns the entire contents of the file. Then the entire file is parsed into an array of tags and text that the foreach loop processes. In trying to save memory, I first was just trying to see if I could process the contents with a while loop like so:

my $buffer = $member_sheet -> contents(); #while ($buffer =~ /\G(<[^<]*>)/scg or $buffer =~ /\G([^<]*(?=<))/scg) + { while ($buffer =~ /(<[^>]+>|[^<]+(?=<))/sg) { $_ = $1; ...

While this seems to work, it's about 100 times slower. I don't know for sure why it's 100 times slower, but I've tried to make a benchmark that shows it should only be about 50% slower:

use Benchmark qw(cmpthese); open(my $fh, "<", 'sheet3.xml') or die "Err: $!"; my $str = do { local $/; <$fh>}; cmpthese(-10, { FOR => sub { pos($str) = 0; for ( $str =~ /(\<.*?\/?\>|.*?(?=\<))/g ) { #print "$_\n"; } }, WHILE_1 => sub { pos($str) = 0; while ( $str =~ /\G(<.*?>|.*?(?=<))/scg ) { $_ = $1; #print "$_\n"; } }, WHILE_2 => sub { pos($str) = 0; while ( $str =~ /\G(<.*?>)/scg or $str =~ /\G(.*?(?=<))/scg ) { $_ = $1; #print "$_\n"; } }, WHILE_3 => sub { pos($str) = 0; while ( $str =~ /\G(<[^>]*>)/scg or $str =~ /\G([^<]*(?=<))/scg ) +{ $_ = $1; #print "$_\n"; } }, }); # Results: s/iter FOR WHILE_3 WHILE_2 WHILE_1 FOR 1.66 -- -27% -31% -35% WHILE_3 1.21 37% -- -6% -11% WHILE_2 1.14 46% 6% -- -5% WHILE_1 1.08 53% 12% 5% --

Is there something wrong with my benchmark (or just something wrong with trying to benchmark this)? Something else going on in Spreadsheet::XLSX? Anyone with enough tuits to look at or comment on this?

TIA for any insights

Replies are listed 'Best First'.
Re: Spreadsheet::XLSX memory and speed
by jmcnamara (Monsignor) on Jun 08, 2012 at 22:49 UTC

    I'm working on a module called Excel::Reader::XLSX that aims to parse XLSX files without the memory overhead of Spreadsheet::XLSX (which is inherited from the design of Spreadsheet::ParseExcel).

    It isn't CPAN quality yet but it reads large XLSX files with a small and constant memory usage (it holds only one row of data in memory at a time). It also runs at a comparable speed. *

    I haven't settled on the final API yet and it is quite limited in functionality but if you are, or anyone else is, interested have a look on GitHub.

    * I'm using XML::LibXML::Reader for parsing but I had to profile and optimise the initial code heavily to get to within 5-10% of the regex parsing in Spreadsheet::XLSX.

    --
    John.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://975226]
Approved by stevieb
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (7)
As of 2024-04-19 10:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found