Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re: Out of Memory

by vkon (Curate)
on May 05, 2011 at 07:35 UTC ( [id://903085]=note: print w/replies, xml ) Need Help??


in reply to Out of Memory

well, looking at Spreadsheet::XLSX module reveals that it uses regular expressions to parse XLSX
$mstr =~ s/<t\/>/<t><\/t>/gsm; # this handles an empty t tag +in the xml <t/>
and this is widely accepted as bad approach.

Actually such regular expressions on huge strings usually cause out-of-memory.

I can not think of any better advice, other than updating this module and then proposing changes to author.

Regards,
Vadim.

Replies are listed 'Best First'.
Re^2: Out of Memory
by Corion (Patriarch) on May 05, 2011 at 07:38 UTC

    While parsing general XML with regular expressions can be fragile, parsing very specific XML can work well with regular expressions. But as the parser still seems to generate the complete data structure in memory instead of nibbling at columns and calling a callback whenever data is available, an event-based rewrite is still necessary.

      I agree.

      Let me share my experience on a problem that OP raised.
      Usually at work, when I need some transformation, I often write small and quick program that slurps entire file and uses regexp to produce a result.
      this works and I use such a program from time to time.
      then, this program fed with a file of ten or hundred Mbytes (which is no more rare nowadays) and I go to my initial program and optimize it so it could rpocess such files.

      Actually - I think - it is quite common situation.

      There's no need to write a very optimal and highly structured program from the very beginning - "dirty hack" suffice as a first step, and then incremental improvements could be made, if needed - the point - they could never be needed :)
      Fortunately, perl is powerful enough to allow such an approach.

Re^2: Out of Memory
by ETLTCHFIG (Novice) on May 05, 2011 at 18:16 UTC

    Ok - I am looking at XLSX to see if I can make changes to remove stuff I dont need - since all i am trying to do in the production code is read the cell value and write it out to csv file

    Can I eliminate all that excel style processing?
    my $member_styles = $self -> {zip} -> memberNamed ('xl/styles. +xml'); my @styles = (); my %style_info = (); if ($member_styles) { foreach my $t ($member_styles -> contents =~ /xf\ numF +mtId="([^"]*)"(?!.*\/cellStyleXfs)/gsm) { #" # $t = $converter -> convert ($t) if $converter +; push @styles, $t; } my $default = $1 || ''; foreach my $t1 (@styles){ $member_styles -> contents =~ /numFmtId="$t1" formatCode=" +([^"]*)/; my $formatCode = $1 || ''; if ($formatCode eq $default || not($formatCode)){ if ($t1 == 9 || $t1==10){ $formatCode="0.00000%";} elsif ($t1 == 14){ $formatCode="m-d-yy";} else { $formatCode=""; } } $style_info{$t1} = $formatCode; $default = $1 || ''; } }

    Also, I am considering reducing the cell object size by removing format and type hash keys So change

    } my $cell =Spreadsheet::ParseExcel::Cell->new( Val => $v, Format => $thisstyle, Type => $type );
    TO
    } my $cell =Spreadsheet::ParseExcel::Cell->new( Val => $v );
    Please let me know what you think
      I think this way.

      given that all you need is to get XLSX data - probably to get some kind of DB values - then, just do not use this CPAN module, just extract your data yourself,
      be your-application-centric.

      But before doing that, drop a letter to CPAN author - it could be that he is responsive and will provide you with a solution soon.
      Otherwise - just reuse his code of "unzipping" the content, and then use your own regular expression.

      But better than that - feed you resulting XML string into properly constructed XPATH expression - and feed this XPATH expression to Xml::LibXML - it is very efficient on XPath expressions, but other modules dealing with XPath also will suffice.
      (I - personally - have good experience with mentioned one, and TIMTOWTDI)

      This would be best way out of this situation - this is how I feel it.

      Is 40Mb - a size of ZIPped XLSX, or it is a size after unpacking?

        40MB is the Zipped XLSX file size - thanks for the guidance vkon! I am trying to use the code in the new method of the XLSX CPAN module and stick my functionality in there so I dont have to store anything in memory
        OK - I THINK I GOT THE PROBLEM - In XLSX.pm the method new issues the following read foreach ($member_sheet -> contents =~ /(\<.*?\/?\>|.*?(?=\<))/g) { This is trying to cache the whole worksheet at a time; one of my worksheets in the XLSX file that is throwing the "Out of Memory" has a million rows Is there a way I can change this to perform line by line reading?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://903085]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (4)
As of 2024-04-19 07:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found