Reading and Writing from / to XLSX file (optimization)

thanos1983 has asked for the wisdom of the Perl Monks concerning the following question:

Hello everyone,

I build a script that is reading and writing XLSX files customized on my requirements. The script seems to be working fine, but I would appreciate if someone could take a look and suggest any minor/major modifications in order to make it more efficient.

Most people will be thinking make more efficient a script that takes milliseconds to execute does it makes sense? I am trying to optimize the script not only in terms of speed but also regarding resources management. I am reading a file (primary file) that contains almost 4500 lines, secondary file contains approximately 2500 lines. I am comparing them and then the output is printed on a third file.

I am trying to optimize the script because both files are growing day by day and they are not going to get any smaller. I have never worked with so big files so maybe the size in reality is really small and the optimization is not necessary.

Enough said let's go the coding staff. Sample of code below:

#!/usr/bin/perl
use utf8;
use strict;
use warnings;
use Data::Dumper;
use File::MMagic;
use Number::Latin;
use Excel::Writer::XLSX;
use List::Util qw( first );
use Spreadsheet::ParseXLSX;
use Spreadsheet::ParseExcel;

die "Usage: $0 Primary (excel sheet) Secondary (excel sheet) Name of n
+ew (excel sheet)\n"
    if (scalar @ARGV != 3);

sub _checkFilesExtensions {
    my ( $fileInput ) = @_;

    foreach my $file (@$fileInput) {
    # Match a dot, followed by any number of non-dots until the
    # end of the line.
    my ($input, $ext) = split('\.', $file);

    die "Please use striclty excel files format .xls or .xlsx '$file'\
+n"
        unless defined $ext;
    die "Please use striclty excel files format .xls or .xlsx '$file'\
+n"
        unless ( $ext eq 'xls' or $ext eq 'xlsx' );
    }
    return;
}

sub _getObjectSpreadSheet {
    my ( $FileToOpen ) = @_;
    my $ft = File::MMagic->new();
    my $type = $ft->checktype_filename( $FileToOpen );
    my $parser;

    if ($type =~ /zip/){
    $parser = Spreadsheet::ParseXLSX->new();
    } else {
    $parser = Spreadsheet::ParseExcel->new();
    }

    my $workbook = $parser->parse( $FileToOpen );
    if ( !defined $workbook ) {
    die $parser->error(), ".\n";
    }
    return $workbook->worksheet( 0 );
}

sub _getDataFromSpreadSheet {
    my ( $worksheet , $RouterIDColumn , $WanInterfaceColumn ) = @_;

    my ( $row_min, $row_max ) = $worksheet->row_range();

    my $name = $worksheet->get_name();

    my %HoH;
    my %hash;
    foreach my $row ( $row_min .. $row_max ) {
    my $cellWanInterface = $worksheet->get_cell( $row, $WanInterfaceCo
+lumn );
    next unless $cellWanInterface;
    if ($cellWanInterface->value()) {
        my $cellRouteID = $worksheet->get_cell( $row, $RouterIDColumn 
+);
        if ($WanInterfaceColumn) {
        $hash{$cellRouteID->value()} = $cellWanInterface->value();
        $row = $row+1; # Plus one because row starts from 0 and counti
+ng from 1
        $HoH{uc int2latin($RouterIDColumn +1).$row} =
        { 'RouterID' => $cellRouteID->value(),
              'Wan Interdace' => $cellWanInterface->value(),
        };
        }
        else {
        $hash{$cellWanInterface->value()} = undef;
        $row = $row+1; # Plus one because row starts from 0 and counti
+ng from 1
        $HoH{uc int2latin($WanInterfaceColumn +1).$row} =
        { 'RouterID' => $cellWanInterface->value(),
              'Wan Interdace' => undef,
        };
        }
    }
    next;
    }
    return \%HoH, \%hash;
}

sub _writeIntoCell {
    my ( $key,
     $hashOfHashesWanInterfaceDataValue,
     $hashTechnicalReportDataValue,
     $worksheet,
     $workbook ) = @_;

    my %formatHash = (
        -font   => 'Arial',
        -size   => 8,
    -border => 7,
    -align  => 'center',
    );

    #  Add and define a format
    my $format = $workbook->add_format( %formatHash );

    # Write some text $row, $column, $text, $format
    $worksheet->write( $key,
               $hashOfHashesWanInterfaceDataValue,
               $format );

    my @characters = split(//, $key);
    my @columnArray = grep /^[[:alpha:]]+$/, @characters;
    my @rowArray = grep /^[[:digit:]]+$/, @characters;
    my $column = join('', @columnArray);
    my $row = join('', @rowArray);
    $column++;
    $key = $column . $row;

    $worksheet->write( $key,
               $hashTechnicalReportDataValue,
               $format );
    return;
}

sub getDataTechicalReport {
    my ( $FileToOpen ) = @_;

    my $worksheet = _getObjectSpreadSheet( $FileToOpen );

    my $RouterIDColumn = 1; # RouterID
    my $WanInterfaceColumn = 25; # Wan Interface

    return _getDataFromSpreadSheet( $worksheet,
                    $RouterIDColumn,
                    $WanInterfaceColumn );
}

sub getDataWanInterface {
    my ( $FileToOpen ) = @_;

    my $worksheet = _getObjectSpreadSheet( $FileToOpen );

    my $RouterIDColumn = undef; # We do not care
    my $WanInterfaceColumn = 0; # A column

    return _getDataFromSpreadSheet( $worksheet,
                    $RouterIDColumn,
                    $WanInterfaceColumn );
}

sub spreadsheetToWrite {
    my ( $hashOfHashesWanInterfaceData,
     $hashTechnicalReportData ) = @_;

    # Create a new Excel workbook
    my $workbook = Excel::Writer::XLSX->new( $ARGV[2] );
    die "Problems creating new Excel file: $!" unless defined $workboo
+k;

    # Add a worksheet
    my $worksheet = $workbook->add_worksheet();

    foreach my $key ( keys %$hashOfHashesWanInterfaceData ) {
    my $RouterID =
        first {
        /$hashOfHashesWanInterfaceData->{$key}{'RouterID'}/
    } keys %$hashTechnicalReportData;

    if ( $RouterID ) {
        _writeIntoCell( $key,
                $hashOfHashesWanInterfaceData->{$key}{'RouterID'},
                $hashTechnicalReportData->{$RouterID},
                $worksheet,
                $workbook );
    }
    else {
        _writeIntoCell( $key,
                $hashOfHashesWanInterfaceData->{$key}{'RouterID'},
                "undef",
                $worksheet,
                $workbook );
    }
    #}
    }

    my %headerFormatHash = (
        -font   => 'Calibri',
        -size   => 11,
        -bg_color  => '#09c8eb',
        -bold   => 1,
    -border => 7,
    -align  => 'center',
    );

    my $headerFormat = $workbook->add_format( %headerFormatHash );

    $worksheet->write( 'A1',
               'DNS Name',
               $headerFormat );

    $worksheet->write( 'B1',
               'WAN Interface',
               $headerFormat );

    return;
}

#### main() ####

_checkFilesExtensions(\@ARGV);

my ( $hashOfHashesTechikalReport,
     $hashTechnicalReport ) = getDataTechicalReport( $ARGV[0] );

# print Dumper $hashOfHashesTechikalReport;
# print Dumper %$hashOfHashesTechikalReport{'B1376'};

my @keysTechicalReport = keys %$hashTechnicalReport;

# print Dumper \@keysTechicalReport;

my ( $hashOfHashesWanInterface,
     $hashWanInterface ) = getDataWanInterface( $ARGV[1] );

# print Dumper $hashOfHashesWanInterface;
# print Dumper $hashWanInterface;
# print Dumper %$hashOfHashesWanInterface{'A13'};

my @keysWanInterface = keys %$hashWanInterface;

# print Dumper \@keysWanInterface;

spreadsheetToWrite( $hashOfHashesWanInterface,
            $hashTechnicalReport );

__END__

$ perl test.pl prime.xlsx secondary.xlsx sample.xlsx
[download]

In order to be able to test / run the script you need to create two files e.g. prime.xlsx and secondary.xlsx. On the prime file you need to add on B column a few lines of data e.g. see below:

B (column)                           Z (column)
Line 1 (or what ever number) node1   GigabitEthernet0/1
Line 2 (or what ever number) node2   GigabitEthernet0/2
Line 3 (or what ever number) node3   GigabitEthernet0/3
[download]

On the secondary file you need to have data only in column 1 e.g.:

A (column)
node1
node2
node3
node4 (extra on purpose)
[download]

The output XLSX file e.g. sample.xlsx should look like:

Column A   Column B
DNS Name   WAN Interface
node1      GigabitEthernet0/1
node2      GigabitEthernet0/2
node3      GigabitEthernet0/3
node4      undef
[download]

One minor note to make here just in case that someone knows to get this done. I am trying to find a way to modify the cell size on the output file. For example I would like to make the output cells size width 4 cm (for example) and height 3 cm (for example). Any ideas if this is possible or I am just looking for something that it is not yet implemented?

Thanks in advance for time and effort trying to assist me.

BR / Thanos

Seeking for Perl wisdom...on the process of learning...not there...yet!

Comment on Reading and Writing from / to XLSX file (optimization) Select or Download Code

Replies are listed 'Best First'.
Re: Reading and Writing from / to XLSX file (optimization) by 1nickt (Canon) on Nov 01, 2017 at 12:18 UTC
Hi, just a recommendation: tighten up your variable naming. It's going to be tiresome to always have to remember which variant of spelling you used for a given var/sub name. Here are three different ones in one statement: `my ( $hashOfHashesTechikalReport, $hashTechnicalReport ) = getDataTechicalReport( $ARGV[0] );` [download] The way forward always starts with a minimal test.	[reply] [d/l]
Re^2: Reading and Writing from / to XLSX file (optimization) by thanos1983 (Parson) on Nov 01, 2017 at 13:33 UTC
Hello 1nickt, You are absolutely right, about that. Today I know what the script does in one year most likely I need to investigate it. Thanks a lot for your time and effort. BR / Thanos Seeking for Perl wisdom...on the process of learning...not there...yet!	[reply] [d/l] [select]
Re: Reading and Writing from / to XLSX file (optimization) by roboticus (Chancellor) on Nov 01, 2017 at 16:23 UTC
thanos1983: I don't really see anything major, but I have some minor comments. First, in `_writeIntoCell` you're creating a new format for each cell entry, rather than reusing them. When I generate spreadsheets, I generally build a few handy styles once when I build the excel object, then reuse them as needed, something like: `my $workbook = Excel::Writer::XLSX->new( $ARGV[2] ); # Normal cell my $fmtNRM = $workbook->add_format({ -font => 'Arial', -size => 8, -border => 7, -align => 'center', }); # Row/Column header my $fmtHDR = $workbook->add_format({ -font => 'Calibri', -size => 11, -bg_color => '#09c8eb', -bold => 1, -border => 7, -align => 'center', });` [download] It may turn out that the module recognizes duplicate styles and doesn't bloat the spreadsheet, but it still feels better to me to create a format once and only once. Some of your variable names could use improvement. For example, in `_getDataFromSpreadsheet`, you're building %HoH and %hash. I can tell from the usage that these are in fact a hash of hashes and a hash, but the name doesn't really indicate anything useful. It looks like %WANinterfaces would be a good name for HoH, but without digging into your data and studying your program a little more intently, I can't really tell what I would use for hash. Finally, I wouldn't worry about optimizing a program unless it's either proven to be a problem, or the general trend indicates that it will become a problem in the immediate future. So if your data files are continuously increasing in size, then write a quickie perl script to generate some bogus input spreadsheets of varying sizes to see how it performs as the data volume increases. Then run it while watching the time and memory consumption. That way, you'll be able to make an educated guess as to what the performance should be for the future. You'll likely find that it can handle quite a lot of volume before the time becomes worth worrying about. It's surprising to me how fast computers are nowadays, so I rarely find it worthwhile to optimize code except when doing some extremely CPU-intensive work. ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply] [d/l] [select]
Re^2: Reading and Writing from / to XLSX file (optimization) by thanos1983 (Parson) on Nov 02, 2017 at 09:41 UTC
Hello roboticus, OMG you are so right, I missed this part completely. Thanks a lot for the review. I need to update the variable names you are right. Thanks again for your time and effort reviewing my code. Seeking for Perl wisdom...on the process of learning...not there...yet!	[reply] [d/l] [select]
Re: Reading and Writing from / to XLSX file (optimization) by vr (Curate) on Nov 01, 2017 at 11:56 UTC
Why so complex? Maybe I don't get it, looking at sample data there's just continuous list @A (column) of keys (strings) in file #2, to be extracted with a one-liner. Almost as easily, file #1 to be read to %H, keys transformed by well-defined rule (?) to correspond to strings from @A. And finally, output @A and @H{@A} -- for sake of efficiency (hope so) "in one go". To start with (though OT), `_checkFilesExtensions` doesn't do what comment says, and `die "..." if $file !~ /\.xlsx?$/i;` Update. W.r.t. your question about how "to modify the cell size", I'm sure you've found the set_row and set_column. The "height" is easy and looks like value in Postscript points. The unit of measure for "width" is something hideous, in number of digits of 11-points-Calibri, but also involves some padding and rounding, i.e. width can be set only approximately. If my reverse-engineering-foo serves me right this time of day, to set a cell to be 40 mm wide and 30 mm high, I'd write `$worksheet-> set_row( 0, 30 / 25.4 * 72 ); $worksheet-> set_column( 0, 0, ( 40 / 25.4 * 72 - 5 ) / 7 );` [download] but I only have LibreOffice here to confirm P.S. And of course, reading my original answer above, I meant "make `%A = map { $_ => 1 } @A`, then only read those lines from file #1, which have keys existing in %A, then etc." :-) And wrong link for "one go" sneaked in, here's perhaps more efficient one. I hope it was a final edit.	[reply] [d/l] [select]
Re^2: Reading and Writing from / to XLSX file (optimization) by thanos1983 (Parson) on Nov 01, 2017 at 12:13 UTC
Hello vr, You got a point there regarding `_checkFilesExtensions`. I have updated the version. Regarding the rest I will take a look and try to update my code. Thanks again for the time and effort, BR. Seeking for Perl wisdom...on the process of learning...not there...yet!	[reply] [d/l] [select]
Re: Reading and Writing from / to XLSX file (optimization) by holli (Abbot) on Nov 01, 2017 at 09:24 UTC
I am trying to optimize the script not only in terms of speed but also regarding resources management. I am reading a file (primary file) that contains almost 4500 lines (Emphasize mine) So? I'd start worrying at four and a half million lines. If Excel itself doesn't choke on the file, I doubt ParseExcel will. Why don't you just mock up a really large dataset, say 100,000 lines and see how your program does? holli You can lead your users to water, but alas, you cannot drown them.	[reply]
Re^2: Reading and Writing from / to XLSX file (optimization) by thanos1983 (Parson) on Nov 01, 2017 at 10:22 UTC
Hello holli, Thanks for the time and effort reading and replying to my question. As I said "I have never worked with so big files so maybe the size in reality is really small and the optimization is not necessary.". I will try to run some crash tests as you said by "mock up a really large dataset, say 100,000 lines and see how your program does and see how the program response." BR / Thanos Seeking for Perl wisdom...on the process of learning...not there...yet!	[reply] [d/l] [select]
Re: Reading and Writing from / to XLSX file (optimization) by Laurent_R (Canon) on Nov 01, 2017 at 11:15 UTC
Same opinion as holli, I would not worry about 4500 lines, or not even about ten or twenty times as many. These are really not large data sets. I'm dealing almost daily with files with tens of millions of lines (and sometimes with even hundreds of millions), there I might have to worry about performance, but not with tens of thousands.	[reply]
Re^2: Reading and Writing from / to XLSX file (optimization) by thanos1983 (Parson) on Nov 01, 2017 at 12:16 UTC
Hello Laurent_R, Thank you for your time and effort. I am lacking of experience with big data files so big. BR / Thanos Seeking for Perl wisdom...on the process of learning...not there...yet!	[reply] [d/l] [select]
Re: Reading and Writing from / to XLSX file (optimization) by ablanke (Monsignor) on Nov 04, 2017 at 12:02 UTC
Hi thanos1983, since no one else mentioned it. Devel::NYTProf helps you finding the performace hotspots of your code. maybe your "small" files are the right place to start with profiling.	[reply]


Perl-Sensitive Sunglasses
	PerlMonks