Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

OpenOffice::OODoc. Slow access to a cell content with getCellValue

by pvaldes (Chaplain)
on Jun 01, 2016 at 14:27 UTC ( [id://1164682] : perlquestion . print w/replies, xml ) Need Help??

pvaldes has asked for the wisdom of the Perl Monks concerning the following question:

I'm playing with the module OpenOffice and trying to extract some info from a spreadsheet to a text file. This is my program so far:

#!/usr/bin/perl use strict; use warnings; use OpenOffice::OODoc; ## module version 2.125.3, from Debian Perl Group, installed with: ## apt install libopenoffice-oodoc-perl open (my $out, '>', "outfile.txt") or die $!; my $file = ooFile("myspreadsheet.ods"); my $content = odfText(file => $file, part => 'content') or die $!; # Note: "odfText can be used in place of ooDocument() # if the calling application is only text-focused # We aren't interested in formats or styles. Perldoc" ## Write to a file the content of the non empty cells ## in the interval B1:G2001 from "sheet2" to "sheet9" ## in the .ods libreoffice spreadsheet. foreach my $sheet (1...8){ foreach my $row (0...2000){ foreach my $col (1...6){ my $value = $content->getCellValue($sheet,$row,$col); next if $value eq ""; # next if cell is empty next if $value eq "Don't keep this"; # or if match some pattern print $out $value,"\n"} # print cell content to outfile print $out "\n";} # and a newline at the end of each row print "okay, next sheet\n-------\n"; }; close $out;

When I run the program it take hours to finish. Much more slow than doing it directly with the mouse. I have also a warning message about a undefined "" value (maybe the problem is that $value is not of type string and need to be converted each time first?).

I would appreciate if you can take a look to this draft and suggest how to make it faster, or maybe point to any bug or possible memory leak. Thank you very much.

Replies are listed 'Best First'.
Re: OpenOffice::OODoc. Slow access to a cell content with getCellValue
by poj (Abbot) on Jun 01, 2016 at 16:43 UTC

    You could try using the getTableText method to fill an array.

    #!/usr/bin/perl use strict; use OpenOffice::OODoc; use Data::Dump 'pp'; my $doc = odfDocument(file => "myspreadsheet.ods"); open (my $out, '>', "outfile.txt") or die $!; foreach my $sheetno (0...$doc->getTableList-1){ my $table = $doc->getTable($sheetno,'normalize'); my ($rows,$cols) = $doc->getTableSize($table); my @text = $doc->getTableText($table); #pp @text; print "Sheet=$sheetno rows=$rows cols=$cols\n"; foreach my $rowno (0...$rows-1){ foreach my $colno (0...$cols-1){ my $value = $text[$rowno][$colno]; next if $value eq ''; print $out $value,"\n" } print $out "\n"; } };
    poj
      Much faster Poj, only 11 minutes running (against "infinite minutes"), but the outfile.txt was totally empty after this time. Something is wrong :-(

        I used this to create a simple test file

        #!/usr/bin/perl use strict; use OpenOffice::OODoc; my $doc = odfDocument( file => 'testspreadsheet.ods', create => 'spreadsheet'); $doc->expandTable(0,5,10); for my $r (0..4){ for my $c (0..9){ $doc->cellValue(0,$r,$c,"text at ".chr(65+$c).($r+1)); } } $doc->save;
        poj

        Update: It seems that there was problem with the memory when I run your script, Poj. This problem could or could not be related with the perl job.

        I'm running again the script with the -d option. When It hits the line

         my $table = $doc->getTable($sheetno,'normalize');

        The perl script uses the 99-100% of CPU and the 10% of memory. After some minutes the memory consume raises to 81-83% until finally the computer frozens and the process is automatically killed without finishing the job. This takes 6-11 min and explains why outfile.txt is empty. It seems that my spreadsheet file is a hard nut to open. I wonder why.

Re: OpenOffice::OODoc. Slow access to a cell content with getCellValue
by anonymized user 468275 (Curate) on Jun 01, 2016 at 19:17 UTC
    Yes, the bug is "I'm playing with the module OpenOffice and trying to extract some info from a spreadsheet to a text file". The correction is s/module//; and the bug is gone ;) Seriously though, Excel and Open office provide text format input and output precisely so that people do not have to do this! A better idea is to consider the text file you can obtain easily enough without programming in itself a temporary inconvenience for which support will be irrelevant later down the path to developing a proper system that does what your business actually needs.

    One world, one people

      Thanks for the opinion, but sometimes you simply need a more sophisticated approach. This is only a first part of a much more complex, non standard, cell selection problem.

      My past experience with openoffice macros ended in a blast of fire and an orgy of mutilated xmls beyond repair, therefore, this time I want Perl to save the day.

        There is likely another way to achieve the overall goal is what I am saying.

        One world, one people

Re: OpenOffice::OODoc. Slow access to a cell content with getCellValue
by pvaldes (Chaplain) on Jun 06, 2016 at 09:57 UTC

    I keep doing my little research, and it seems that problems with memory in this module aren't new. This is a similar scenery reported as a bug in 2015

    "Reading several odf files in a loop reusing the same variable exhausts all memory. Example:"

    # Fails #!/usr/bin/perl use OpenOffice::OODoc; $myfile = "test.odt"; while(1){my $var = odfDocument(file => $myfile)}

    Why this happens? Do you think that adding an undef $value after the "foreach my $col" loop in my script would make a difference?. I didn't notice any improvement

    This is interesting also, any experience with this other module?:

    "For applications that make intensive spreadsheet processing, a switch to ODF::lpOD (more efficient and user-friendly than OpenOffice::OODoc) should be considered"