Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Reading and Writing from / to XLSX file (optimization)

by thanos1983 (Parson)
on Nov 01, 2017 at 08:54 UTC ( [id://1202500]=perlquestion: print w/replies, xml ) Need Help??

thanos1983 has asked for the wisdom of the Perl Monks concerning the following question:

Hello everyone,

I build a script that is reading and writing XLSX files customized on my requirements. The script seems to be working fine, but I would appreciate if someone could take a look and suggest any minor/major modifications in order to make it more efficient.

Most people will be thinking make more efficient a script that takes milliseconds to execute does it makes sense? I am trying to optimize the script not only in terms of speed but also regarding resources management. I am reading a file (primary file) that contains almost 4500 lines, secondary file contains approximately 2500 lines. I am comparing them and then the output is printed on a third file.

I am trying to optimize the script because both files are growing day by day and they are not going to get any smaller. I have never worked with so big files so maybe the size in reality is really small and the optimization is not necessary.

Enough said let's go the coding staff. Sample of code below:

#!/usr/bin/perl use utf8; use strict; use warnings; use Data::Dumper; use File::MMagic; use Number::Latin; use Excel::Writer::XLSX; use List::Util qw( first ); use Spreadsheet::ParseXLSX; use Spreadsheet::ParseExcel; die "Usage: $0 Primary (excel sheet) Secondary (excel sheet) Name of n +ew (excel sheet)\n" if (scalar @ARGV != 3); sub _checkFilesExtensions { my ( $fileInput ) = @_; foreach my $file (@$fileInput) { # Match a dot, followed by any number of non-dots until the # end of the line. my ($input, $ext) = split('\.', $file); die "Please use striclty excel files format .xls or .xlsx '$file'\ +n" unless defined $ext; die "Please use striclty excel files format .xls or .xlsx '$file'\ +n" unless ( $ext eq 'xls' or $ext eq 'xlsx' ); } return; } sub _getObjectSpreadSheet { my ( $FileToOpen ) = @_; my $ft = File::MMagic->new(); my $type = $ft->checktype_filename( $FileToOpen ); my $parser; if ($type =~ /zip/){ $parser = Spreadsheet::ParseXLSX->new(); } else { $parser = Spreadsheet::ParseExcel->new(); } my $workbook = $parser->parse( $FileToOpen ); if ( !defined $workbook ) { die $parser->error(), ".\n"; } return $workbook->worksheet( 0 ); } sub _getDataFromSpreadSheet { my ( $worksheet , $RouterIDColumn , $WanInterfaceColumn ) = @_; my ( $row_min, $row_max ) = $worksheet->row_range(); my $name = $worksheet->get_name(); my %HoH; my %hash; foreach my $row ( $row_min .. $row_max ) { my $cellWanInterface = $worksheet->get_cell( $row, $WanInterfaceCo +lumn ); next unless $cellWanInterface; if ($cellWanInterface->value()) { my $cellRouteID = $worksheet->get_cell( $row, $RouterIDColumn +); if ($WanInterfaceColumn) { $hash{$cellRouteID->value()} = $cellWanInterface->value(); $row = $row+1; # Plus one because row starts from 0 and counti +ng from 1 $HoH{uc int2latin($RouterIDColumn +1).$row} = { 'RouterID' => $cellRouteID->value(), 'Wan Interdace' => $cellWanInterface->value(), }; } else { $hash{$cellWanInterface->value()} = undef; $row = $row+1; # Plus one because row starts from 0 and counti +ng from 1 $HoH{uc int2latin($WanInterfaceColumn +1).$row} = { 'RouterID' => $cellWanInterface->value(), 'Wan Interdace' => undef, }; } } next; } return \%HoH, \%hash; } sub _writeIntoCell { my ( $key, $hashOfHashesWanInterfaceDataValue, $hashTechnicalReportDataValue, $worksheet, $workbook ) = @_; my %formatHash = ( -font => 'Arial', -size => 8, -border => 7, -align => 'center', ); # Add and define a format my $format = $workbook->add_format( %formatHash ); # Write some text $row, $column, $text, $format $worksheet->write( $key, $hashOfHashesWanInterfaceDataValue, $format ); my @characters = split(//, $key); my @columnArray = grep /^[[:alpha:]]+$/, @characters; my @rowArray = grep /^[[:digit:]]+$/, @characters; my $column = join('', @columnArray); my $row = join('', @rowArray); $column++; $key = $column . $row; $worksheet->write( $key, $hashTechnicalReportDataValue, $format ); return; } sub getDataTechicalReport { my ( $FileToOpen ) = @_; my $worksheet = _getObjectSpreadSheet( $FileToOpen ); my $RouterIDColumn = 1; # RouterID my $WanInterfaceColumn = 25; # Wan Interface return _getDataFromSpreadSheet( $worksheet, $RouterIDColumn, $WanInterfaceColumn ); } sub getDataWanInterface { my ( $FileToOpen ) = @_; my $worksheet = _getObjectSpreadSheet( $FileToOpen ); my $RouterIDColumn = undef; # We do not care my $WanInterfaceColumn = 0; # A column return _getDataFromSpreadSheet( $worksheet, $RouterIDColumn, $WanInterfaceColumn ); } sub spreadsheetToWrite { my ( $hashOfHashesWanInterfaceData, $hashTechnicalReportData ) = @_; # Create a new Excel workbook my $workbook = Excel::Writer::XLSX->new( $ARGV[2] ); die "Problems creating new Excel file: $!" unless defined $workboo +k; # Add a worksheet my $worksheet = $workbook->add_worksheet(); foreach my $key ( keys %$hashOfHashesWanInterfaceData ) { my $RouterID = first { /$hashOfHashesWanInterfaceData->{$key}{'RouterID'}/ } keys %$hashTechnicalReportData; if ( $RouterID ) { _writeIntoCell( $key, $hashOfHashesWanInterfaceData->{$key}{'RouterID'}, $hashTechnicalReportData->{$RouterID}, $worksheet, $workbook ); } else { _writeIntoCell( $key, $hashOfHashesWanInterfaceData->{$key}{'RouterID'}, "undef", $worksheet, $workbook ); } #} } my %headerFormatHash = ( -font => 'Calibri', -size => 11, -bg_color => '#09c8eb', -bold => 1, -border => 7, -align => 'center', ); my $headerFormat = $workbook->add_format( %headerFormatHash ); $worksheet->write( 'A1', 'DNS Name', $headerFormat ); $worksheet->write( 'B1', 'WAN Interface', $headerFormat ); return; } #### main() #### _checkFilesExtensions(\@ARGV); my ( $hashOfHashesTechikalReport, $hashTechnicalReport ) = getDataTechicalReport( $ARGV[0] ); # print Dumper $hashOfHashesTechikalReport; # print Dumper %$hashOfHashesTechikalReport{'B1376'}; my @keysTechicalReport = keys %$hashTechnicalReport; # print Dumper \@keysTechicalReport; my ( $hashOfHashesWanInterface, $hashWanInterface ) = getDataWanInterface( $ARGV[1] ); # print Dumper $hashOfHashesWanInterface; # print Dumper $hashWanInterface; # print Dumper %$hashOfHashesWanInterface{'A13'}; my @keysWanInterface = keys %$hashWanInterface; # print Dumper \@keysWanInterface; spreadsheetToWrite( $hashOfHashesWanInterface, $hashTechnicalReport ); __END__ $ perl test.pl prime.xlsx secondary.xlsx sample.xlsx

In order to be able to test / run the script you need to create two files e.g. prime.xlsx and secondary.xlsx. On the prime file you need to add on B column a few lines of data e.g. see below:

B (column) Z (column) Line 1 (or what ever number) node1 GigabitEthernet0/1 Line 2 (or what ever number) node2 GigabitEthernet0/2 Line 3 (or what ever number) node3 GigabitEthernet0/3

On the secondary file you need to have data only in column 1 e.g.:

A (column) node1 node2 node3 node4 (extra on purpose)

The output XLSX file e.g. sample.xlsx should look like:

Column A Column B DNS Name WAN Interface node1 GigabitEthernet0/1 node2 GigabitEthernet0/2 node3 GigabitEthernet0/3 node4 undef

One minor note to make here just in case that someone knows to get this done. I am trying to find a way to modify the cell size on the output file. For example I would like to make the output cells size width 4 cm (for example) and height 3 cm (for example). Any ideas if this is possible or I am just looking for something that it is not yet implemented?

Thanks in advance for time and effort trying to assist me.

BR / Thanos

Seeking for Perl wisdom...on the process of learning...not there...yet!

Replies are listed 'Best First'.
Re: Reading and Writing from / to XLSX file (optimization)
by 1nickt (Canon) on Nov 01, 2017 at 12:18 UTC

    Hi, just a recommendation: tighten up your variable naming. It's going to be tiresome to always have to remember which variant of spelling you used for a given var/sub name. Here are three different ones in one statement:

    my ( $hashOfHashesTechikalReport, $hashTechnicalReport ) = getDataTechicalReport( $ARGV[0] );


    The way forward always starts with a minimal test.

      Hello 1nickt,

      You are absolutely right, about that. Today I know what the script does in one year most likely I need to investigate it.

      Thanks a lot for your time and effort.

      BR / Thanos

      Seeking for Perl wisdom...on the process of learning...not there...yet!
Re: Reading and Writing from / to XLSX file (optimization)
by roboticus (Chancellor) on Nov 01, 2017 at 16:23 UTC

    thanos1983:

    I don't really see anything major, but I have some minor comments.

    First, in _writeIntoCell you're creating a new format for each cell entry, rather than reusing them. When I generate spreadsheets, I generally build a few handy styles once when I build the excel object, then reuse them as needed, something like:

    my $workbook = Excel::Writer::XLSX->new( $ARGV[2] ); # Normal cell my $fmtNRM = $workbook->add_format({ -font => 'Arial', -size => 8, -border => 7, -align => 'center', }); # Row/Column header my $fmtHDR = $workbook->add_format({ -font => 'Calibri', -size => 11, -bg_color => '#09c8eb', -bold => 1, -border => 7, -align => 'center', });

    It may turn out that the module recognizes duplicate styles and doesn't bloat the spreadsheet, but it still feels better to me to create a format once and only once.

    Some of your variable names could use improvement. For example, in _getDataFromSpreadsheet, you're building %HoH and %hash. I can tell from the usage that these are in fact a hash of hashes and a hash, but the name doesn't really indicate anything useful. It looks like %WANinterfaces would be a good name for HoH, but without digging into your data and studying your program a little more intently, I can't really tell what I would use for hash.

    Finally, I wouldn't worry about optimizing a program unless it's either proven to be a problem, or the general trend indicates that it will become a problem in the immediate future. So if your data files are continuously increasing in size, then write a quickie perl script to generate some bogus input spreadsheets of varying sizes to see how it performs as the data volume increases. Then run it while watching the time and memory consumption. That way, you'll be able to make an educated guess as to what the performance should be for the future. You'll likely find that it can handle quite a lot of volume before the time becomes worth worrying about.

    It's surprising to me how fast computers are nowadays, so I rarely find it worthwhile to optimize code except when doing some extremely CPU-intensive work.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      Hello roboticus,

      OMG you are so right, I missed this part completely. Thanks a lot for the review.

      I need to update the variable names you are right.

      Thanks again for your time and effort reviewing my code.

      Seeking for Perl wisdom...on the process of learning...not there...yet!
Re: Reading and Writing from / to XLSX file (optimization)
by vr (Curate) on Nov 01, 2017 at 11:56 UTC

    Why so complex? Maybe I don't get it, looking at sample data there's just continuous list @A (column) of keys (strings) in file #2, to be extracted with a one-liner. Almost as easily, file #1 to be read to %H, keys transformed by well-defined rule (?) to correspond to strings from @A. And finally, output @A and @H{@A} -- for sake of efficiency (hope so) "in one go".

    To start with (though OT), _checkFilesExtensions doesn't do what comment says, and die "..." if $file !~ /\.xlsx?$/i;

    Update. W.r.t. your question about how "to modify the cell size", I'm sure you've found the set_row and set_column. The "height" is easy and looks like value in Postscript points. The unit of measure for "width" is something hideous, in number of digits of 11-points-Calibri, but also involves some padding and rounding, i.e. width can be set only approximately. If my reverse-engineering-foo serves me right this time of day, to set a cell to be 40 mm wide and 30 mm high, I'd write

    $worksheet-> set_row( 0, 30 / 25.4 * 72 ); $worksheet-> set_column( 0, 0, ( 40 / 25.4 * 72 - 5 ) / 7 );

    but I only have LibreOffice here to confirm

    P.S. And of course, reading my original answer above, I meant "make %A = map { $_ => 1 } @A, then only read those lines from file #1, which have keys existing in %A, then etc." :-)

    And wrong link for "one go" sneaked in, here's perhaps more efficient one. I hope it was a final edit.

      Hello vr,

      You got a point there regarding _checkFilesExtensions. I have updated the version.

      Regarding the rest I will take a look and try to update my code.

      Thanks again for the time and effort, BR.

      Seeking for Perl wisdom...on the process of learning...not there...yet!
Re: Reading and Writing from / to XLSX file (optimization)
by holli (Abbot) on Nov 01, 2017 at 09:24 UTC
    I am trying to optimize the script not only in terms of speed but also regarding resources management. I am reading a file (primary file) that contains almost 4500 lines (Emphasize mine)
    So? I'd start worrying at four and a half million lines. If Excel itself doesn't choke on the file, I doubt ParseExcel will. Why don't you just mock up a really large dataset, say 100,000 lines and see how your program does?


    holli

    You can lead your users to water, but alas, you cannot drown them.

      Hello holli,

      Thanks for the time and effort reading and replying to my question.

      As I said "I have never worked with so big files so maybe the size in reality is really small and the optimization is not necessary.". I will try to run some crash tests as you said by "mock up a really large dataset, say 100,000 lines and see how your program does and see how the program response."

      BR / Thanos

      Seeking for Perl wisdom...on the process of learning...not there...yet!
Re: Reading and Writing from / to XLSX file (optimization)
by Laurent_R (Canon) on Nov 01, 2017 at 11:15 UTC
    Same opinion as holli, I would not worry about 4500 lines, or not even about ten or twenty times as many. These are really not large data sets.

    I'm dealing almost daily with files with tens of millions of lines (and sometimes with even hundreds of millions), there I might have to worry about performance, but not with tens of thousands.

      Hello Laurent_R,

      Thank you for your time and effort. I am lacking of experience with big data files so big.

      BR / Thanos

      Seeking for Perl wisdom...on the process of learning...not there...yet!
Re: Reading and Writing from / to XLSX file (optimization)
by ablanke (Monsignor) on Nov 04, 2017 at 12:02 UTC
    Hi thanos1983,

    since no one else mentioned it.

    Devel::NYTProf helps you finding the performace hotspots of your code.

    maybe your "small" files are the right place to start with profiling.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1202500]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (6)
As of 2024-04-19 17:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found