http://qs321.pair.com?node_id=1138607

usertest has asked for the wisdom of the Perl Monks concerning the following question:

We have a requirement to read a source file, and for every row, apply some transformations and write to a output file. The code is however taking a long time approximately 25 seconds for processing a 500MB file. Please suggest if we can apply some performance improvements.
#!/usr/bin/perl use strict; use warnings; use POSIX qw(strftime); my $infile=$ARGV[0]; my $outfile=$ARGV[1]; open(DATAIN,"<$infile"); open(DATAOUT,">$outfile"); while(<DATAIN>) { my($line)=$_; chomp($line); my @Fields = split(',', $line,9); my $X=$Fields[8]; my $Y = substr $X,0,10; my $A = strftime "%M,%Y,%m,%d,%H,%j,%W,%u,%A", gmtime $Y; my $B = substr($A, 0, index($A, ',')); my $C = int($B/5); my $D = int($B/15); print DATAOUT $line,",",$Y,",",$A,",",$C,",",$D,"\n"; } close(DATAIN); close(DATAOUT);

Replies are listed 'Best First'.
Re: Optimizing perl code performance
by MidLifeXis (Monsignor) on Aug 14, 2015 at 17:57 UTC

    Remove the processing and replace it with just a print DATAOUT $line;. Time this. See how much time is actually spent in the processing part of this. I would guess that the largest part of your time is spent doing I/O, especially if $infile and $outfile are on the same data path (controller, disk, ...).

    Once you show that a significant portion of time is spent in the processing, and that splitting the work will help, _then_ perhaps you _might_ gain value from splitting the processing apart into multiple workers. Personally, I would guess from experience that you will get better performance gains by putting your input and output files on different data paths.

    --MidLifeXis

      The standard advice for optimizing any code is to profile it, and then optimize only those parts which are shown to consume a large part of the processing time. In this case, MidLifeXis's suggestion accomplishes much the same thing with much less work.
      Bill
Re: Optimizing perl code performance
by enemyofthestate (Monk) on Aug 14, 2015 at 20:56 UTC

    I find Devel::NYTProf to work really well at finding where my Perl code is spending its time.

Re: Optimizing perl code performance
by Anonymous Monk on Aug 14, 2015 at 18:03 UTC
    Your sample data (provided elsewhere in this thread) reflects lines of approximately 390 bytes. Your filesize is 500MB. So you have over 1.3 million lines to process. On my system calling strftime "%M,%Y,%m,%d,%H,%j,%W,%u,%A", gmtime $Y 1.3 million times takes over 30 seconds.

    If the first ten digits that comprise "$Y" are repeated many times, you could cache on that value and not have to make 1.3 million calls to gmtime, and 1.3 million calls to strftime. But you're still calling split 1.3 million times, substr 1.3 million times, and so on.

    You may find it better to chunk the input file and process it with four workers, each writing its own output files. Then cat the output files together. It's possible (though not entirely certain) that with a sane number of workers each doing its share of the work this could go faster.

Re: Optimizing perl code performance
by Intermediate Dave (Novice) on Aug 15, 2015 at 03:10 UTC
    Perl's core modules include 'Memoize', which attempts to make functions faster by "trading space for time."

    http://perldoc.perl.org/Memoize.html

    But I'd have to do trial-and-error to see what worked. (I'm wondering if a regExp would be faster than split)
    /.*,.*,.*,.*,.*,.*,.*,.*,(*),.*/ my $X = $1
    Then you wouldn't even need to call chomp. (And in general, I'm wondering if it would speed things up to combine the other instructions.)
    chomp( my($line)=$_ ) ;
    UPDATE: It occurred to me that the code re-declares every variable again on every pass through the loop. It seems like it might help to try
    my ($X, $Y, $A, $B, $C, $D); while(<DATAIN>)
      I'm wondering if a regExp would be faster than split.
      split takes a regex, so an optimized regex similar to the split could only be marginally better, and potentially much worse, especially for simple cases like this.

      Likewise chomp is very simple, and is probably zillions of times faster than IO.

      Profile, then fiddle. Lather, rinse, repeat.

      -QM
      --
      Quantum Mechanics: The dreams stuff is made of

Re: Optimizing perl code performance
by flexvault (Monsignor) on Aug 14, 2015 at 17:26 UTC

    Welcome usertest,

    A few lines of sample input would help.

    Regards...Ed

    "Well done is better than well said." - Benjamin Franklin

      Thanks Ed,Data is given below, We need to append the date related information extracted from the 9th field (whose first 10 digits signifies the time since epoch in seconds) and place this additional data at the end of each line in the output file

Re: Optimizing perl code performance
by marioroy (Prior) on Aug 15, 2015 at 14:48 UTC

    Update: My next attempt compares running with threads and non-threads on the Mac and Linux. There is something strange about strftime that causes the script to slow down either with threads or non-threads depending on the OS.

    Update: The serial code runs faster on a Linux VM. For some reason, the strftime function degrades in performance when running with many workers (even threads on Linux). I'm not sure why.

    In my testing, strftime performs poorly when many workers call it simultaneously. This is fine with threads, but must limit the number of workers.

    On my laptop (running Mac OS X), the serial code completes in 19.131 seconds for a 500 MB file and MCE completing in 6.569 seconds. Most of that time is coming from strftime. I verified this by replacing $A = strftime with $A = $Y which completes in 1.842 seconds.

    #!/usr/bin/perl use strict; use warnings; use threads; use threads::shared; use POSIX qw(strftime); use MCE::Loop; use MCE::Candy; my $infile = $ARGV[0]; my $outfile = $ARGV[1]; open(DATAOUT, ">", $outfile); ## Workers process chunks in parallel until completed. ## Output order is preserved via MCE::Candy::out_iter_fh MCE::Loop::init { chunk_size => "2m", max_workers => 4, use_slurpio => 1, gather => MCE::Candy::out_iter_fh(\*DATAOUT), use_threads => 1 }; mce_loop_f { my ($mce, $chunkRef, $chunkID) = @_; my ($output, @Fields, $X, $Y, $A, $B, $C, $D) = (""); open my $CHUNKIN, "<", $chunkRef; while( my $line = <$CHUNKIN> ) { chomp $line; @Fields = split(',', $line, 9); $X = $Fields[8]; $Y = substr $X, 0, 10; $A = strftime "%M,%Y,%m,%d,%H,%j,%W,%u,%A", gmtime $Y; $B = substr($A, 0, index($A, ',')); $C = int($B/5); $D = int($B/15); $output .= $line.",$Y,$A,$C,$D\n"; } close $CHUNKIN; MCE->gather($chunkID, $output); } $infile; close(DATAOUT);

    Kind regareds, Mario.

      Update: The disparity is coming from strftime.

      Update: One must use threads on the Mac and non-threads on Linux for best performance. This is mind-boggling to me. Replacing the strftime line with $A = $Y completes in a couple seconds for threads or non-threads on the Mac and Linux.

      The same 500 MB input file is used by both OS.

      Mac OS X Serial: 18.185s Mac OS X Parallel: 6.687s threads Mac OS X Parallel: 42.526s non-threads CentOS 7 VM Serial: 10.832s CentOS 7 VM Parallel: 23.849s threads CentOS 7 VM Parallel: 2.993s non-threads
      #!/usr/bin/perl use strict; use warnings; use threads; # Comment out threads for child processes use POSIX qw(strftime); use MCE::Loop; use MCE::Candy; my $mutex :shared = 0; my $infile = $ARGV[0]; my $outfile = $ARGV[1]; open(DATAOUT, ">", $outfile); ## Workers process chunks in parallel until completed. ## Output order is preserved via MCE::Candy::out_iter_fh MCE::Loop::init { chunk_size => "2m", max_workers => 4, use_slurpio => 1, gather => MCE::Candy::out_iter_fh(\*DATAOUT) }; mce_loop_f { my ($mce, $chunkRef, $chunkID) = @_; my ($output, @Fields, $X, $Y, $A, $B, $C, $D, @G) = (""); open my $CHUNKIN, "<", $chunkRef; while( my $line = <$CHUNKIN> ) { chomp $line; @Fields = split(',', $line, 9); $X = $Fields[8]; $Y = substr $X, 0, 10; @G = gmtime $Y; $A = strftime "%M,%Y,%m,%d,%H,%j,%W,%u,%A", @G; $B = substr($A, 0, index($A, ',')); $C = int($B/5); $D = int($B/15); $output .= $line.",$Y,$A,$C,$D\n"; } close $CHUNKIN; MCE->gather($chunkID, $output); } $infile; close(DATAOUT);
A reply falls below the community's threshold of quality. You may see it by logging in.