Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Out of memory and While replacements with excel XLSX application

by Weinstar (Initiate)
on Oct 08, 2012 at 03:50 UTC ( [id://997733]=perlquestion: print w/replies, xml ) Need Help??

Weinstar has asked for the wisdom of the Perl Monks concerning the following question:

Hello. For the following I feel I should lead with a quick disclaimer. The code is my first attempt at a Perl script and a non-html script in general, so it might read like the first attempt at a script - however the script technically works.

Furthermore I don't deal with coding in my daily life and studies and therefore do not know if the code conforms to all existing standards - I ran it through perl::tidy, used strict and warnings (no return) but that is about it - please advice if it is unreadable in some way.

Anyhow, The problem I am facing is one of memory - the script returns "out of memory". I believe it is related to http://learn.perl.org/faq/perlfaq3.html#How-can-I-make-my-Perl-program-take-less-memory- but cannot figure out how to implement without changing the results of the run.

The case is as follows: I need to go over 20 seperate excel files (.xls) and grap values based on certain patterns and then copy them into new .XLSX sheets. The total amount of rows is close to 1 million.

The script worked fine when I tried to write to .xls but it hit the limit of available rows. So i changed it to ::XLSX and now it runs out of memory. The strange thing is that it does not seem to be in the writing of excel part that the memory expires, but around the time it is done with grabing values, or maybe when I copy one of the returned array at @righ = @right (done because the original is destroyed in a splice). The system I work on has 8GB of ram, the error happens at around a total memory use of 4GB of ram, but I can see from test on a single sheet, that the final part (when all values are in the arrays) takes up a lot of memory. I have a feeling that this memory is excessive and can be done in better way. The script is on around 2GB of memory at the time of error. four of the eight gb were installed after the original run to see if this helped.

My question is - how do I solve this? I understand that there may be room for loads of improvements and I will gladly listen to every kind of response, but my main concern goes to the apparent memory shortage. I have a feeling I am copying the data into memory several times, but I do not know how to avoid this.

If I have overlooked an already existing answer a reference would be appreciated and an apology returned, but please understand that any deviation from my case will quickly complicate things for me

#!C:\strawberry\perl\bin\perl ####Load Packages use strict; use warnings; use Win32::OLE qw(in with); use Win32::OLE::Const 'Microsoft Excel'; use Win32::OLE::Variant; use Win32::OLE::NLS qw(:LOCALE :DATE); use List::MoreUtils qw(natatime); use Excel::Writer::XLSX; ############################################################ ####Initialize Excel Object and Count Files to Process $Win32::OLE::Warn = 3; #die on errors... #get already active Excel application or open new my $Excel = Win32::OLE->GetActiveObject('Excel.Application') || Win32::OLE->new( 'Excel.Application', 'Quit' ); #Get the file names of the production files and prints a status of the + results my $ProdDirect = 'C:/Users/McLovin/Documents/Thesis/Data/'; opendir DH, $ProdDirect or die "Cannot open $ProdDirect: $!"; my @files = grep { !-d } readdir DH; print "@files\n"; closedir DH; my $NumberofFiles = @files; print "Number of files is: $NumberofFiles\n"; ###################################################################### +########## ################################################################# ################Start Processing the Data######################## ################################################################# my $counti; #used to count the excelfiles my $usefile; #The file that is being used my @recID; #an array of ID's for the worksheets my $ID; #The actual ID being pushed to the array my $val; #A test value that is used to test whether a row should b +e included my @right ; #The first of the arrays that will be filled during the process an +d will be further disected later my @prod; #The second array, same as above my @legal; #The third ####declare a lot of variables used below when getting information fro +m the excel sheets my $c1; my $c4; my $c5; my $c6; my $c7; my $c8; my $c9; my $c10; my $c11; my $c12; my $c13; my $c14; my $c15; my $c17; my $c18; my $c20; my $c21; my $c22; my $c23; my $c24; my $c25; my $c26; my $c27; my $c32; my $c33; my $c34; my $c35; my $c36; my $c37; my $c38; my $c39; #####Loop over all the files for ( $counti = 0 ; $counti < $NumberofFiles ; $counti++ ) { ##### open Excel file -- This is the input data files print "Working on file $counti of $NumberofFiles\n"; $usefile = $files[$counti]; ###This points to the folder where the excel files are placed my $workfile = "C:/Users/McLovin/Documents/Thesis/Datatal/" . "$us +efile"; print "$workfile\n"; my $Book = $Excel->Workbooks->Open($workfile); #####Work on each sheet in the workbook #####This is the number of worksheets in the workbook my $sheetcnt = $Book->Worksheets->Count(); foreach my $r (1) { my $Sheet = $Book->Worksheets($r); $ID = $Sheet->{Name}; push @recID, $ID; print "Worksheet name is $Sheet->{Name}\n"; ##work on each row and column my $Tot_Rows = $Sheet->UsedRange->Rows->{'Count'}; my $Tot_Cols = $Sheet->UsedRange->Columns->{'Count'}; #Extract the necessary information from the rows and columns and place +s them in three arrays @right @legal and @prod foreach my $row ( 1 .. $Tot_Rows ) { foreach my $col (1) { # skip empty cells next unless defined $Sheet->Cells( $row, $col )->{'Val +ue'}; $val = $Sheet->Cells( $row, $col )->{'Value'}; if ( $val =~ /^\d{10}$/ ) { #Gets column one information foreach my $col1 (1) { next unless defined $Sheet->Cells( $row, $col1 )->{'Valu +e'}; $c1 = $Sheet->Cells( $row, $col1 )->{'Value'}; push @right, $c1; push @legal, $c1; push @prod, $c1; #Gets region from column 2 foreach my $col4 (4) { next unless defined $Sheet->Cells( $row, $col4 )->{' +Value'}; $c4 = $Sheet->Cells( $row, $col4 )->{'Valu +e'}; push @right, $c4; push @legal, $c4; push @prod, $c4; } #Gets column 3 information foreach my $col5 (5) { next unless defined $Sheet->Cells( $row, $col5 )->{' +Value'}; $c5 = $Sheet->Cells( $row, $col5 )->{'Valu +e'}; push @right, $c5; push @legal, $c5; push @prod, $c5; } #Get the description that needs to be processe +d later foreach my $col6 (6) { next unless defined $Sheet->Cells( $row, $col6 )->{' +Value'}; $c6 = $Sheet->Cells( $row, $col6 )->{'Valu +e'}; push @legal, $c6; } #Gets area foreach my $col7 (7) { next unless defined $Sheet->Cells( $row, $col7 )->{' +Value'}; $c7 = $Sheet->Cells( $row, $col7 )->{'Valu +e'}; push @right, $c7; } #Gets column 8 information foreach my $col8 (8) { next unless defined $Sheet->Cells( $row, $col8 )->{' +Value'}; $c8 = $Sheet->Cells( $row, $col8 )->{'Valu +e'}; push @right, $c8; } #Gets column 9 foreach my $col9 (9) { next unless defined $Sheet->Cells( $row, $col9 )->{' +Value'}; $c9 = $Sheet->Cells( $row, $col9 )->{'Valu +e'}; push @right, $c9; } #and so on.... foreach my $col10 (10) { next unless defined $Sheet->Cells( $row, $col10 )->{ +'Value'}; $c10 = $Sheet->Cells( $row, $col10 )->{'Va +lue'}; push @right, $c10; push @legal, $c10; } #same foreach my $col11 (11) { next unless defined $Sheet->Cells( $row, $col11 )->{ +'Value'}; $c11 = $Sheet->Cells( $row, $col11 )->{'Va +lue'}; push @right, $c11; push @legal, $c11; } #same foreach my $col12 (12) { $c12 = $Sheet->Cells( $row, $col12 )->{'Va +lue'}; push @right, $c12; push @prod, $c12; } #same foreach my $col13 (13) { $c13 = $Sheet->Cells( $row, $col13 )->{'Va +lue'}; push @prod, $c13; } #same foreach my $col14 (14) { $c14 = $Sheet->Cells( $row, $col14 )->{'Va +lue'}; push @prod, $c14; } #same foreach my $col15 (15) { $c15 = $Sheet->Cells( $row, $col15 )->{'Va +lue'}; push @prod, $c15; } #same foreach my $col17 (17) { $c17 = $Sheet->Cells( $row, $col17 )->{'Va +lue'}; push @prod, $c17; } #same foreach my $col18 (18) { $c18 = $Sheet->Cells( $row, $col18 )->{'Va +lue'}; push @prod, $c18; } #same foreach my $col20 (20) { $c20 = $Sheet->Cells( $row, $col20 )->{'Va +lue'}; push @prod, $c20; } #same foreach my $col21 (21) { $c21 = $Sheet->Cells( $row, $col21 )->{'Va +lue'}; push @prod, $c21; } #same foreach my $col22 (22) { $c22 = $Sheet->Cells( $row, $col22 )->{'Va +lue'}; push @prod, $c22; } #same foreach my $col23 (23) { $c23 = $Sheet->Cells( $row, $col23 )->{'Va +lue'}; push @prod, $c23; } #same foreach my $col24 (24) { $c24 = $Sheet->Cells( $row, $col24 )->{'Va +lue'}; push @prod, $c24; } #same foreach my $col25 (25) { $c25 = $Sheet->Cells( $row, $col25 )->{'Va +lue'}; push @prod, $c25; } #same foreach my $col26 (26) { $c26 = $Sheet->Cells( $row, $col26 )->{'Va +lue'}; push @prod, $c26; } #same foreach my $col27 (27) { $c27 = $Sheet->Cells( $row, $col27 )->{'Va +lue'}; push @prod, $c27; } #32-39 production data foreach my $col32 (32) { $c32 = $Sheet->Cells( $row, $col32 )->{'Va +lue'}; push @prod, $c32; } foreach my $col33 (33) { $c33 = $Sheet->Cells( $row, $col33 )->{'Va +lue'}; push @prod, $c33; } foreach my $col34 (34) { $c34 = $Sheet->Cells( $row, $col34 )->{'Va +lue'}; push @prod, $c34; } foreach my $col35 (35) { $c35 = $Sheet->Cells( $row, $col35 )->{'Va +lue'}; push @prod, $c35; } foreach my $col36 (36) { $c36 = $Sheet->Cells( $row, $col36 )->{'Va +lue'}; push @prod, $c36; } foreach my $col37 (37) { $c37 = $Sheet->Cells( $row, $col37 )->{'Va +lue'}; push @prod, $c37; } foreach my $col38 (38) { $c38 = $Sheet->Cells( $row, $col38 )->{'Va +lue'}; push @prod, $c38; } foreach my $col39 (39) { $c39 = $Sheet->Cells( $row, $col39 )->{'Va +lue'}; push @prod, $c39; } } } } } } $Book->Close; } ###################################################################### +########## #Prints of the created arrays #print "@right\n"; #print "@legal\n"; #print "@prod\n"; ###################################################################### +########## ###################################################################### +############## #the process that eliminates duplicates in the @right array based on c +riteria #makes a copy of the array as it is destroyed in the next while statem +ent, This is maybe one of the problems my @rig = @right; #copy of array my @righ; # a new array that is the result of the while state +ment below my %seen; # a hash that stores agreementnumbers for unique en +tries #The array is in exact sets of 9 strings and i want it spliced a inter +vals of exactly those intervals while ( my ( $m, $n, $o, $p, $q, $r, $s, $t, $y ) = splice( @right, 0, + 9 ) ) { last if $m !~ /^\d{10}$/; next if $seen{$m}++; if ( $n =~ /Specific_region/ ) { if ( $o =~ /NG/ ) { push @righ, $m, $o, $p, $q, $r, $s, $t; } } } # print "@righ\n"; my @leg; #an array that holds the result for the next while statem +ent my @spli; #used as a container for certain entries in the while sta +tement my @joi; #another middle of equation array for picking up results %seen = () ; #emties the previous hash as the uniqueness of entries is also imp +ortant here my @tes; #yet another array for picking up results ###################################################################### +########## #the process that insures that each returned value is printed in the c +orrect #form and coupled with the ten digit number #again the array is organized in 6 values in a row that need to be sep +erated out into rows. while ( my ( $h, $aa, $rr, $j, $k, $l ) = splice( @legal, 0, 6 ) ) { last if $h !~ /^\d{10}$/; if ( $aa =~ /Specific_region/ ) { if ( $rr =~ /NG/ ) { if ( $j =~ /\n/ ) { next if $seen{$h}++; my @spli = split( /\n/, $j ); foreach my $n (@spli) { if ( $n =~ /LSD/ ) { my @tes = split( /LSD/, $n ); foreach my $lon (@tes) { if ( $lon =~ /SEC/ ) { my @joi = split( /-|W|:|\s|,|\(/, $lon + ); my $chans = @joi; my $eleg = join( "", @joi[ 0, 2, 1, $chans - 1 +] ); push @leg, $eleg, $h, $k, $l; } } } elsif ( $n =~ /\(/ ) { my @joi = split( /-|W|:|\s|\(/, $n ); my $chans = @joi; my $eleg = join( "", @joi[ 0, 2, 1, $chans - +2 ] ); push @leg, $eleg, $h, $k, $l; } else { my @joi = split( /-|W|:|\s/, $n ); my $chans = @joi; my $eleg = join( "", @joi[ 0, 2, 1, $chans - +1 ] ); push @leg, $eleg, $h, $k, $l; } } } else { next if $seen{$h}++; my @joi = split( /-|W|:|\s/, $j ); my $chans = @joi; my $eleg = join( "", @joi[ 0, 2, 1, $chans - 1 ] ); push @leg, $eleg, $h, $k, $l; } } } } #print "@leg\n"; my @peg; foreach my $loma (@leg) { if ( $loma =~ /^\d{7}$/ ) { substr( $loma, 6, 0, 0 ); push @peg, $loma; } else { push @peg, $loma; } } ###################################################################### +############## #the process that creates the production array for the entries my @produ; while ( my ( $cp, $aaa, $rrr, $dp, $ep, $fp, $gp, $hp, $ip, $jp, $kp, $lp, $mp, $np, $op, $pp, $qp, $rp, $sp, $tp, $up, $vp, $wp, $yp, $xp ) = splice( @prod, 0, 25 ) ) { last if $cp !~ /^\d{10}$/; if ( $aaa =~ /Specific_region/ ) { if ( $rrr =~ /NG/ ) { unless ( $dp =~ /a specific repeated text for all relevant entr +ies/ ) { #eliminate this if statement for option two, where entries with no act +ual production is included if ( defined($qp) && $qp =~ /\d\d-\d\d-\d{4}/ ) { push @produ, $dp, $ep, $cp, $fp, $gp, $hp, $ip, $j +p, $kp, $lp, $mp, $np, $op, $pp, $qp, $rp, $sp, $tp, $up +, $vp, $wp, $yp, $xp; } } } } } # print "@produ\n"; my @nwells; #an array that collects the results #creates the 4.2 entries "agreements with no wells while ( my ( $mn, $nn, $on, $pn, $qn, $rn, $sn, $tn, $yn ) = splice( @rig, 0, 9 ) ) { last if $mn !~ /^\d{10}$/; if ( $nn =~ /Specific_region/ ) { if ( $on =~ /NG/ ) { if ( $yn =~ /a specific repeated text for all relevant ent +ries/ ) { push @nwells, $mn, $on, $pn, $qn, $rn, $sn, $tn, $yn; } } } } # print "@nwells\n"; #Places results into arrays of arrays for easy computation in excel. u +ses natatime again the arrays # are of a specific size. per row. my @AAR; { my $iter = natatime 7, @righ; while ( my @tmp = $iter->() ) { push @AAR, \@tmp; } } my @BAR; { my $iter = natatime 4, @peg; while ( my @tmp = $iter->() ) { push @BAR, \@tmp; } } my @CAR; { my $iter = natatime 23, @produ; while ( my @tmp = $iter->() ) { push @CAR, \@tmp; } } my @DAR; { my $iter = natatime 8, @nwells; while ( my @tmp = $iter->() ) { push @DAR, \@tmp; } } #####The new excel sheets that should contain the results my $workbooknew = Excel::Writer::XLSX->new('re1.xlsx'); my $worksheetnew = $workbooknew->add_worksheet(); $worksheetnew->keep_leading_zeros(); $worksheetnew->set_column( 'A:G', 30 ); $worksheetnew->write_col( 'A2', \@AAR ); ####This is the data on the legalnumber - introduce keep_leading zeroe +s for correct legal form my $workbooknew1 = Excel::Writer::XLSX->new('re2.xlsx'); my $worksheetnew1 = $workbooknew1->add_worksheet(); $worksheetnew1->keep_leading_zeros(); $worksheetnew1->set_column( 'A:D', 15 ); $worksheetnew1->write_col( 'A2', \@BAR ); # # ####This is the data on the production of the wells - introduce ke +ep_leading zeroes for correct legal form my $workbooknew2 = Excel::Writer::XLSX->new('re3.xlsx'); my $worksheetnew2 = $workbooknew2->add_worksheet(); my $worksheetnew3 = $workbooknew2->add_worksheet(); $worksheetnew2->set_column( 'A:W', 50 ); $worksheetnew2->write_col( 'A2', \@CAR ); $worksheetnew3->set_column( 'A:H', 30 ); $worksheetnew3->write_col( 'A2', \@DAR );

Replies are listed 'Best First'.
Re: Out of memory and While replacements with excel XLSX application
by jmcnamara (Monsignor) on Oct 08, 2012 at 10:02 UTC

    The "out of memory" issue is probably caused by Excel::Writer::XLSX. Especially if you have 1 million rows.

    You can reduce its memory usage (almost completely*) by setting the workbook set_optimization() method.

    ... my $workbook = Excel::Writer::XLSX->new( file.xlsx' ); $workbook->set_optimization(); my $worksheet = $workbook->add_worksheet(); ...

    * See the Speed and Memory Usage section of the Excel::Writer::XLSX docs for a full explanation.

    --
    John.

      This answer was the first one I succefully implemeted. It was a solution to the original question of how to avoid a "out of memory" return, while still writing to .XLSX files. Thanks to the poster.

      The total system memory use never exceeded 4GB as before and the total runtime seemed unchanged.

      I will test some the other solutions to see if the runtime can be optimized through better coding on my part.

Re: Out of memory and While replacements with excel XLSX application
by aufflick (Deacon) on Oct 08, 2012 at 04:37 UTC
    Welcome to coding!

    I haven't used XLSX myself and this might be a naive suggestion but could your situation be satisfied with CSV files as your output (which you could then convert to XLSX files in Excel if you want to send XLSX to others instead of CSV)? Especially if you have huge amounts of data you can simply print to a raw text file, but you can also use very streamlined modules like Text::CSV.

Re: Out of memory and While replacements with excel XLSX application
by marquezc329 (Scribe) on Oct 08, 2012 at 09:00 UTC
    As anonymous before me stated, use subroutines to break up your program into smaller more manageable chunks. Then try to optimize each segment. Not only will this get you closer to solving your problems and more than likely eliminate alot of the memory usage along the way, but it will allow you to post a more concentrated chunk of code that is more likely to attract the attention of the monks.

    From perlfaq3:

         "When it comes to time-space tradeoffs, Perl nearly always prefers to throw memory at a problem. Scalars in Perl use more memory than strings in C, arrays take more than that, and hashes use even more."

    You have created an enormous amount of temporary scalars that are all named very similarly (i.e.  $c1..$c39) and used in very repetitive code. To start off, maybe try creating a subroutine to handle those more efficiently.
    sub extract { my @arrays = (\@right, \@legal, \@prod); foreach 1..39 { foreach my $col ($_) { next unless defined $Sheet->Cells($row, $col)->{'Value'}; my $c = $Sheet->Cells($row, $col)->{'Value'}; map {push @$_, $c} @arrays; } } }

    This is just an example, but I'm sure if you go through your program you'll be able to find multiple situations where things can be cut down and made faster, simpler, and easier to maintain/read.
    It might do you well to clean up your comments, and include a sample of the data you plan on working with before making a post like this. It'll increase the likelihood that you'll get help from the guys that know most. You might try reading How (Not) To Ask A Question. Good Luck!
Re: Out of memory and While replacements with excel XLSX application
by Anonymous Monk on Oct 08, 2012 at 08:00 UTC
        ValueTargetCols should loop over rows first then cols, but it probably makes no difference in the results in this case
Re: Out of memory and While replacements with excel XLSX application
by afoken (Chancellor) on Oct 08, 2012 at 17:30 UTC
    I need to go over 20 seperate excel files (.xls) and grap values based on certain patterns and then copy them into new .XLSX sheets. The total amount of rows is close to 1 million.

    I think an Excel sheet with one million lines is a clear case for using a relational database (plus a report generator) instead of Excel. If you have no better idea, use MS Access as a front-end to some stripped-down MS SQL Server (last time I used it the stripped-down version was called MSDE) or PostgreSQL.

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
Re: Out of memory and While replacements with excel XLSX application
by Weinstar (Initiate) on Oct 08, 2012 at 22:57 UTC

    Hello. Thank you to all of you for your reponses. There seems to be a lot to go on and a general sentiment, that this is doable, and with a lot of optimization, so that is great. It will proberbly take me some time to work through it all, but I will post the final update, as soon as i am done, for future reference. Again I thank you for taking your time to look over my problem.

      Hi, I'm using Spreadsheet::XLSX to parse the xlsx file. for around 20000 rows , the memory it takes around 800 MB, though size of file is 5 MB. Can anybody helps me to optimize it? following is the code:

      my ($self, $in_file, $out_file) = @_; $count = @_; $count -= 1; if(2 != $count) { util->logit("[$PARENT_PROC_NAME]: xlsx_to_csv:invalid args [@_ +] [$count]"); return $FAILURE; } my $excel = Spreadsheet::XLSX -> new ($in_file); my $line; my $out_csv_name = $out_file; my $orig_out_file = $out_csv_name; #binmode CSV; binmode CSV, ":utf8"; my $Csv = Text::CSV_XS->new({ 'quote_char' => '"', 'escape_char' => '"', 'sep_char' => ',', 'binary' => 1, }); my $i = 1; foreach my $sheet (@{$excel -> {Worksheet}}) { $i++; $out_csv_name = "$orig_out_file$underscore$sheet->{Name}.csv"; #$out_csv_name =~ s/\s+//g; @ret_array[$i] = $out_csv_name; open CSV, "> $out_csv_name" || die "Cannot create csv file: $! +" ; $sheet -> {MaxRow} ||= $sheet -> {MinRow}; foreach my $row ($sheet -> {MinRow} .. $sheet -> {MaxRow}) { my @Row; $sheet -> {MaxCol} ||= $sheet -> {MinCol}; foreach my $col ($sheet -> {MinCol} .. $sheet -> {MaxCol} +) { my $cell = $sheet -> {Cells} [$row] [$col]; my $Value = ""; if ($cell) { #$Value = $cell->Value; #rounds off #$Value = $cell->{_Value}; #rounds off #$Value = $cell->Value(); #rounds off #$Value = $cell->{Val}; $Value = $cell->unformatted(); #my $avl = $sheet->{Cells}[$row][$col]{Val}; #print "avl:[$avl]\n"; #$Value = $cell->value(); if ($Value eq 'GENERAL') { $Value = $cell->{Val}; } } last if $col == $sheet->{MinCol} and !$Value; push(@Row, $Value); } next unless @Row; my $Status = $Csv->combine(@Row); if (!defined $Status) { my $Error = $Csv->error_input(); warn "ERROR FOUND!: $Error"; } if (defined $Status) { my $Line = $Csv->string(); print CSV "$Line\n"; }

      I don't know how to do it...I'm new to perl. Kindly guide me.

        Can't help you right now (sorry), but as a quick suggestion -- it's best to post your question as a new top-level node in Seekers of Perl Wisdom.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://997733]
Approved by davido
Front-paged by davido
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (5)
As of 2024-04-19 11:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found