Spreadsheet::XLSX memory and speed

runrig has asked for the wisdom of the Perl Monks concerning the following question:

I've been trying to see if Spreadsheet::XLSX can be made to not consume so much memory, and I've implemented the CellHandler and NotSetCell attributes from Spreadsheet::ParseExcel, and this helps somewhat, but part of the problem is the extraction of each file in the zip archive to an in memory variable (an xlsx file is a zip archive of many xml files). E.g., one of the worksheet xml files is about 8MB, and the code that parses it like so:

foreach ($member_sheet -> contents =~ /(\<.*?\/?\>|.*?(?=\<))/g) {
 ...
[download]

$member_sheet is an Archive::Zip object for the worksheet xml file and contents() returns the entire contents of the file. Then the entire file is parsed into an array of tags and text that the foreach loop processes. In trying to save memory, I first was just trying to see if I could process the contents with a while loop like so:

my $buffer = $member_sheet -> contents();
#while ($buffer =~ /\G(<[^<]*>)/scg or $buffer =~ /\G([^<]*(?=<))/scg)
+ {
while ($buffer =~ /(<[^>]+>|[^<]+(?=<))/sg) {
    $_ = $1;
...
[download]

While this seems to work, it's about 100 times slower. I don't know for sure why it's 100 times slower, but I've tried to make a benchmark that shows it should only be about 50% slower:

use Benchmark qw(cmpthese);

open(my $fh, "<", 'sheet3.xml') or die "Err: $!";
my $str = do { local $/; <$fh>};

cmpthese(-10, {

  FOR   => sub {
    pos($str) = 0;
    for ( $str =~ /(\<.*?\/?\>|.*?(?=\<))/g ) {
      #print "$_\n";
    }
  },
  WHILE_1 => sub {
    pos($str) = 0;
    while ( $str =~ /\G(<.*?>|.*?(?=<))/scg ) {
      $_ = $1;
      #print "$_\n";
    }
  },
  WHILE_2 => sub {
    pos($str) = 0;
    while ( $str =~ /\G(<.*?>)/scg or $str =~ /\G(.*?(?=<))/scg ) {
      $_ = $1;
      #print "$_\n";
    }
  },
  WHILE_3 => sub {
    pos($str) = 0;
    while ( $str =~ /\G(<[^>]*>)/scg or $str =~ /\G([^<]*(?=<))/scg ) 
+{
      $_ = $1;
      #print "$_\n";
    }
  },
});

# Results:
        s/iter     FOR WHILE_3 WHILE_2 WHILE_1
FOR       1.66      --    -27%    -31%    -35%
WHILE_3   1.21     37%      --     -6%    -11%
WHILE_2   1.14     46%      6%      --     -5%
WHILE_1   1.08     53%     12%      5%      --
[download]

Is there something wrong with my benchmark (or just something wrong with trying to benchmark this)? Something else going on in Spreadsheet::XLSX? Anyone with enough tuits to look at or comment on this?

TIA for any insights

Comment on Spreadsheet::XLSX memory and speed Select or Download Code

Replies are listed 'Best First'.

Re: Spreadsheet::XLSX memory and speed
by jmcnamara (Monsignor) on Jun 08, 2012 at 22:49 UTC

I'm working on a module called Excel::Reader::XLSX that aims to parse XLSX files without the memory overhead of Spreadsheet::XLSX (which is inherited from the design of Spreadsheet::ParseExcel).

It isn't CPAN quality yet but it reads large XLSX files with a small and constant memory usage (it holds only one row of data in memory at a time). It also runs at a comparable speed. *

* I'm using XML::LibXML::Reader for parsing but I had to profile and optimise the initial code heavily to get to within 5-10% of the regex parsing in Spreadsheet::XLSX.

--
John.

[reply]


Perl: the Markov chain saw
	PerlMonks