Re: Out of Memory

Replies are listed 'Best First'.
Re^2: Out of Memory by Corion (Patriarch) on May 05, 2011 at 07:38 UTC
While parsing general XML with regular expressions can be fragile, parsing very specific XML can work well with regular expressions. But as the parser still seems to generate the complete data structure in memory instead of nibbling at columns and calling a callback whenever data is available, an event-based rewrite is still necessary.	[reply]
Re^3: Out of Memory by vkon (Curate) on May 05, 2011 at 08:28 UTC
I agree. Let me share my experience on a problem that OP raised. Usually at work, when I need some transformation, I often write small and quick program that slurps entire file and uses regexp to produce a result. this works and I use such a program from time to time. then, this program fed with a file of ten or hundred Mbytes (which is no more rare nowadays) and I go to my initial program and optimize it so it could rpocess such files. Actually - I think - it is quite common situation. There's no need to write a very optimal and highly structured program from the very beginning - "dirty hack" suffice as a first step, and then incremental improvements could be made, if needed - the point - they could never be needed :) Fortunately, perl is powerful enough to allow such an approach.	[reply]
Re^2: Out of Memory by ETLTCHFIG (Novice) on May 05, 2011 at 18:16 UTC
Ok - I am looking at XLSX to see if I can make changes to remove stuff I dont need - since all i am trying to do in the production code is read the cell value and write it out to csv file Can I eliminate all that excel style processing? my $member_styles = $self -> {zip} -> memberNamed ('xl/styles. +xml'); my @styles = (); my %style_info = (); if ($member_styles) { foreach my $t ($member_styles -> contents =~ /xf\ numF +mtId="([^"])"(?!.\/cellStyleXfs)/gsm) { #" # $t = $converter -> convert ($t) if $converter +; push @styles, $t; } my $default = $1 \|\| ''; foreach my $t1 (@styles){ $member_styles -> contents =~ /numFmtId="$t1" formatCode=" +([^"]*)/; my $formatCode = $1 \|\| ''; if ($formatCode eq $default \|\| not($formatCode)){ if ($t1 == 9 \|\| $t1==10){ $formatCode="0.00000%";} elsif ($t1 == 14){ $formatCode="m-d-yy";} else { $formatCode=""; } } $style_info{$t1} = $formatCode; $default = $1 \|\| ''; } } [download] Also, I am considering reducing the cell object size by removing format and type hash keys So change `} my $cell =Spreadsheet::ParseExcel::Cell->new( Val => $v, Format => $thisstyle, Type => $type );` [download] TO `} my $cell =Spreadsheet::ParseExcel::Cell->new( Val => $v );` [download] Please let me know what you think	[reply] [d/l] [select]
Re^3: Out of Memory by vkon (Curate) on May 05, 2011 at 19:26 UTC
I think this way. given that all you need is to get XLSX data - probably to get some kind of DB values - then, just do not use this CPAN module, just extract your data yourself, be your-application-centric. But before doing that, drop a letter to CPAN author - it could be that he is responsive and will provide you with a solution soon. Otherwise - just reuse his code of "unzipping" the content, and then use your own regular expression. But better than that - feed you resulting XML string into properly constructed XPATH expression - and feed this XPATH expression to Xml::LibXML - it is very efficient on XPath expressions, but other modules dealing with XPath also will suffice. (I - personally - have good experience with mentioned one, and TIMTOWTDI) This would be best way out of this situation - this is how I feel it. Is 40Mb - a size of ZIPped XLSX, or it is a size after unpacking?	[reply]
Re^4: Out of Memory by ETLTCHFIG (Novice) on May 05, 2011 at 19:45 UTC
40MB is the Zipped XLSX file size - thanks for the guidance vkon! I am trying to use the code in the new method of the XLSX CPAN module and stick my functionality in there so I dont have to store anything in memory	[reply]
Re^4: Out of Memory by ETLTCHFIG (Novice) on May 05, 2011 at 20:31 UTC
OK - I THINK I GOT THE PROBLEM - In XLSX.pm the method new issues the following read foreach ($member_sheet -> contents =~ /(\<.?\/?\>\|.?(?=\<))/g) { This is trying to cache the whole worksheet at a time; one of my worksheets in the XLSX file that is throwing the "Out of Memory" has a million rows Is there a way I can change this to perform line by line reading?	[reply]
Re^5: Out of Memory by vkon (Curate) on May 06, 2011 at 06:17 UTC


Perl Monk, Perl Meditation
	PerlMonks