Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Improving dismal performance - Part 1

by PoorLuzer (Beadle)
on May 12, 2009 at 18:48 UTC ( [id://763564]=perlquestion: print w/replies, xml ) Need Help??

PoorLuzer has asked for the wisdom of the Perl Monks concerning the following question:

I have this script I wrote to convert from one ASCII file format to another.

I have done the coding so bad that I get only 10kb/min output :-(

I tried to Dprof the code, but it seems unless you make the perl code terminate normally, dprof output is not of much use.

I use Tie::File to open the source file as well as create the output file.

Here is the code:

#!/usr/bin/perl # top stats # # CPU TTY PID USERNAME PRI NI SIZE RES STATE TIME %WCPU % +CPU COMMAND # 3 pts/12 13833 me 241 20 30444K 18760K run 0:13 52.90 39.85 +perl # 3 pts/12 13833 me 241 20 45036K 33356K run 0:27 62.02 56.93 +perl # 0 pts/12 13833 me 241 20 48748K 37116K run 1:34 78.13 77.99 +perl # 3 pts/12 13833 me 241 20 53996K 42364K run 5:40 71.00 70.88 +perl # 3 pts/12 13833 me 241 20 72172K 60460K run 44:38 72.95 72.83 +perl # # Some file stats # # -rw-r--r-- 1 me mine 2100352 May 12 11:47 mineOut # -rw-r--r-- 1 me mine 221005 May 12 12:56 mineoutput.co +nverted.to.other # # -rw-r--r-- 1 me mine 2100352 May 12 11:47 mineOut # -rw-r--r-- 1 me mine 239670 May 12 12:57 mineoutput.co +nverted.to.other # # -rw-r--r-- 1 me mine 2100352 May 12 11:47 mineOut # -rw-r--r-- 1 me mine 261315 May 12 12:58 mineoutput.co +nverted.to.other # # -rw-r--r-- 1 me mine 2100352 May 12 11:47 mineOut # -rw-r--r-- 1 me mine 989435 May 12 13:59 mineoutput.co +nverted.to.other # # Thoroughput is around 18665 bytes to 21645 bytes per min -> ~20kb/mi +n # At this rate 2100352 bytes output will take 113 mins ! # Reality check : 728120 bytes in 1 hour (from 12:58 to 13:59) : 72812 +0 / 60 = 12135 bytes / min -> ~10kb/min use strict; use warnings; use Tie::File; use Data::Dumper; # open an existing file in read-only mode use Fcntl 'O_RDONLY'; # Unfortunately it seems, mine and other field names are different. He +nce, we create a map between the two and replace the mine field name +with the other one whereever available # This is how you do the mapping # mine + <-> other # If your mine and other field names are same, keep this mapping empty our %fieldNameMapping = (); # qw( # MSC_CDR_TYPE + RECORD_TYPE # MSC_CDR_SEQ_NUM + callIdentificationNumber # MSC_CDR_REFER_NUM + networkCallRef # MSC_CALL_START_TIME + start_date_time_format # MSC_CALL_DURATION + charge_duration_secs # MSC_PARTIAL_TYPE + msc_partial_type # AX_FIRST_CALLED_LOC_INFO + firstCalledLocInformation # ); # Put the remaining fields our @array; tie @array, 'Tie::File', 'inp', memory => 50_000_000, mode => O_RDONLY +, recsep => "\n" or die $!; our @arrayOfother = (); tie @arrayOfother, 'Tie::File', 'mineoutput.converted.to.other' or die + $!; our $dx = 0; our $recordID = 0; our $recordHeader = undef; our %recordBodyToWriteOut = (); our $recordTrailer = undef; for($dx = 0; $dx < @array; ++$dx) { #if($array[$dx++] =~ /Level \(([0-9]+)\) "([^"]+)"/) if($array[1 + $dx] =~ /Level \(1\) "([^"]+)"$/) { if($array[2 + $dx] =~ /Level \(2\) "([^"]+)"$/) { if($array[3 + $dx] =~ /Record \(([0-9]+)\) "([^"]+)"$/) { $recordID = $1; print STDERR "[*]Got record type $2, number $recordID\ +n"; # Write out the record in other format until we get en +d of record $dx += 3; #print "RECORD\n"; $recordHeader = "RECORD\n"; # First value in the heade +r $recordHeader .= "#addkey\n#filename FF\n#input_id 001 +\n"; %recordBodyToWriteOut = (); # Reset the record body do { if($array[$dx++] =~ /"([^"]+)" = "([^"]+)"$/) { if($1 eq 'MSC_CDR_TYPE') { $recordHeader .= "#input_type $2\n#out +put_id\n#output_type $2\n#source_id SRC\n"; } if(exists($fieldNameMapping { $1 })) { #print "F " . $fieldNameMapping { $1 } + . " $2\n"; $recordBodyToWriteOut { $fieldNameMapp +ing { $1 } } = $2; } else { #print "F $1 $2\n"; $recordBodyToWriteOut { $1 } = $2; } } } until( ($array[1 + $dx] =~ /End of Record \(${recordI +D}\)$/) && ($array[2 + $dx] =~ /End of Level \(2\)$/) && ($array[3 + $dx] =~ /End of Level \(1\)$/) ); $recordTrailer = ".\n"; # First value in the Trai +ler $dx += 2; # Now write out the header, fields and trailer #print $recordHeader; push @arrayOfother, $recordHeader; # We want the fields to come out in sorted order foreach my $key (sort keys %recordBodyToWriteOut) { #print "F $key " . $recordBodyToWriteOut { $ke +y } . "\n"; push @arrayOfother, "F $key " . $recordBodyToW +riteOut { $key } . "\n"; } #print $recordTrailer; push @arrayOfother, $recordTrailer; } } } }

Here is some input data:

Start of Data ********************************************************************** Level (1) "COMMONRec" Level (2) "MSCCDR" Record (1) "MSCGSMRec" "MSC_CDR_TYPE" = "MOC" "MSC_CALL_START_TIME" = "20090122105929" "MSC_CALL_END_TIME" = "20090122105944" "MSC_CALL_DURATION" = "15" "MSC_PARTIAL_INDICATOR" = "S" Sub Record (1) "AXECallDataRecord" "AX_DISCONNECT_PARTY" = "1" "AX_CHARGED_PARTY" = "0" "AX_TRANSLATED_TON" = "1" End of Sub Record (1) End of Record (1) End of Level (2) End of Level (1) Level (1) "COMMONRec" Level (2) "MSCCDR" Record (2) "MSCGSMRec" "MSC_CDR_TYPE" = "MTC" "MSC_PARTIAL_TYPE" = "0" "MSC_CALL_START_TIME" = "20090122105927" "MSC_CALL_END_TIME" = "20090122105945" "MSC_CALL_DURATION" = "18" "MSC_PARTIAL_INDICATOR" = "S" Sub Record (1) "AXECallDataRecord" "AX_DISCONNECT_PARTY" = "1" "AX_CHARGED_PARTY" = "0" "AX_SWITCH_IDENTITY" = "0001" "AX_RELATED_NUMBER" = "7F4595" "AX_FIRST_CALLED_LOC_INFO" = "25F010233203BE" End of Sub Record (1) End of Record (2) End of Level (2) End of Level (1)

Here is some output:

RECORD #addkey #filename FF #input_id 001 #input_type MOC #output_id #output_type MOC #source_id SRC F AX_CHARGED_PARTY 0 F AX_DISCONNECT_PARTY 1 F AX_TRANSLATED_TON 1 F MSC_CALL_DURATION 15 F MSC_CALL_END_TIME 20090122105944 F MSC_CALL_START_TIME 20090122105929 F MSC_CDR_TYPE MOC F MSC_PARTIAL_INDICATOR S . RECORD #addkey #filename FF #input_id 001 #input_type MTC #output_id #output_type MTC #source_id SRC F AX_CHARGED_PARTY 0 F AX_DISCONNECT_PARTY 1 F AX_FIRST_CALLED_LOC_INFO 25F010233203BE F AX_RELATED_NUMBER 7F4595 F AX_SWITCH_IDENTITY 0001 F MSC_CALL_DURATION 18 F MSC_CALL_END_TIME 20090122105945 F MSC_CALL_START_TIME 20090122105927 F MSC_CDR_TYPE MTC F MSC_PARTIAL_INDICATOR S F MSC_PARTIAL_TYPE 0 .

Is this completely hopeless? Should I not use a hash to store the field-value pairs? Should I not use Tie::File and store the file contents in the array?

Any other optimizations you can suggest? The error message "Use of uninitialized value within @array in pattern match (m//) .. at line 75" can always be done last.

Replies are listed 'Best First'.
Re: Improving dismal performance - Part 1
by ELISHEVA (Prior) on May 12, 2009 at 21:46 UTC

    There are several reasons why this script is slow and Tie::Array is only one of them.

    • You are processing each line at least twice. When you run into a line for the first level you look ahead and parse to the end of the record. At this point you have processed N lines. But instead of skipping ahead N lines your for loop advances just one line. This means you visit each line a second time and compare it against a regular expression.
    • Your regular expressions are capturing values you never use.
    • You are reading and parsing the line with the MSC_CDR_TYPE 2x. You could avoid this by placing properties in a hash as you read each line.

    By keeping a bit of state and storing field values in a hash as you find them, you can completely eliminate the need to use an array and most of the internal if/else statements and loops as well. Here is much simplified version of your parser:

    use strict; use warnings; sub printRecord; #-------------------------------------------------- # Parsing loop #-------------------------------------------------- my $fhOut = \*STDOUT; my $iLevel=0; my %hFields; while (my $sLine = <DATA>) { #if line defines the level, set level if ($sLine =~ /^\s*(?:Level|Record|Sub Record)\s+\(\d+\)/) { $iLevel++; } elsif ($sLine =~ /^\s*End of/) { $iLevel--; } else { my ($k, $v) = $sLine =~ /\s+\"(\w+)\"\s+=\s+\"([^"]*)\"/; $hFields{$k}=$v; } #if level back to 0, dump record if ($iLevel == 0) { printRecord($fhOut, \%hFields); %hFields=(); } } #-------------------------------------------------- # SUBROUTINE DEFINITIONS #-------------------------------------------------- sub printRecord { my ($fhOut, $hFields) = @_; my $sIOType = $hFields->{MSC_CDR_TYPE}; print $fhOut "RECORD\n"; print $fhOut "#addkey\n"; print $fhOut "#filename FF\n"; print $fhOut "#input_id 001\n"; print $fhOut "#input_type $sIOType\n"; print $fhOut "#output_id\n"; print $fhOut "#output_type $sIOType\n"; print $fhOut "#source_id SRC\n"; foreach my $k (sort keys %$hFields) { my $v = $hFields->{$k}; print $fhOut "F $k $v\n"; } print $fhOut ".\n"; } #cut and paste sample data from above __DATA__
      Hi!

      Excellent analysis, thanks!

      Well I will try out all your suggestions and complete code later today, but one MAJOR change I did was to just change the output file from using Tie::File to normal file IO (using open)

      .. and the performance got multiplied by approx 7 - 10 times.

      This of course becomes apparent once you look at the profiler output I pasted above.

      I ran the profiler on a file having some 14k "records" (839754 lines) and the tmon results:

      Total Elapsed Time = 456.5656 Seconds User+System Time = 206.2156 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 20.2 41.73 41.732 167950 0.0000 0.0000 Tie::File::_read_record 14.0 28.96 154.93 157920 0.0000 0.0001 Tie::File::_fetch 12.4 25.71 180.65 157920 0.0000 0.0001 Tie::File::FETCH 10.2 21.16 35.610 157920 0.0000 0.0000 Tie::File::Cache::lookup 7.16 14.77 53.769 839753 0.0000 0.0001 Tie::File::Cache::insert 5.64 11.62 12.893 839755 0.0000 0.0000 Tie::File::_seek 5.60 11.54 38.998 839753 0.0000 0.0000 Tie::File::Heap::insert 5.33 10.99 10.991 839753 0.0000 0.0000 Tie::File::Cache::_heap_m +ove 5.13 10.57 23.874 839753 0.0000 0.0000 Tie::File::Heap::_insert_ +new 3.30 6.796 9.688 739447 0.0000 0.0000 Tie::File::Heap::promote 3.25 6.701 24.732 1 6.7008 24.731 Tie::File::_fill_offsets 3.14 6.474 6.474 157920 0.0000 0.0000 Tie::File::Heap::_nseq 2.31 4.756 14.444 739447 0.0000 0.0000 Tie::File::Heap::lookup 1.12 2.311 2.311 839753 0.0000 0.0000 Tie::File::Heap::_nelts_i +nc 0.62 1.271 1.271 839760 0.0000 0.0000 Fcntl::__ANON__
Re: Improving dismal performance - Part 1
by jethro (Monsignor) on May 12, 2009 at 20:15 UTC

    I'm pretty sure tie is not very fast. Remember that tie means you have to call a subroutine every time you access a tied value. Even the size test ("$dx<@array") inside your for loop might be more expensive than you think

    Tie has its use when you have to jump arround in a document that is to large to fit into memory, but you seem to parse the file sequentially. I assume that at least the file is too big to always fit into memory, otherwise you simply could read it in in one chunk

    So better read the file line by line. If you need to do lookaheads (i.e. check $dx+3), use an array as fifo or better drop the lookaheads. You should be able to do that in most cases if you for example parse the file with a state machine.

    A state machine is simply a single variable, the state variable. You change this value depending on the lines you see. So a value of 5 could mean 'I just parsed a "record" statement and expect MSC thingies now'. Depending on what you parse, the state variable would then change again or even stay in the same state. And the parser would be a big switch or if-then-else construct. Check wikipedia for more info

    UPDATE: Corrected example

      I would be willing to bet money on this theory.
Re: Improving dismal performance - Part 1
by tim.bunce (Scribe) on May 12, 2009 at 20:35 UTC
    DProf is broken. Use Devel::NYTProf instead.
Re: Improving dismal performance - Part 1
by John M. Dlugosz (Monsignor) on May 12, 2009 at 19:31 UTC
    You might try getting it to terminate normally, at least on a small set of test data. Even if it's not doing its full and normal job, at least run through the code you want to profile.

    —John

      I ran it on a input of 20 "records" (1173 lines).

      Here is the output:

      Total Elapsed Time = 5.231984 Seconds User+System Time = 2.941984 Seconds Inclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 85.7 0.014 2.524 1037 0.0000 0.0024 Tie::File::PUSH 84.8 0.014 2.496 1037 0.0000 0.0024 Tie::File::SPLICE 83.1 0.133 2.447 1037 0.0001 0.0024 Tie::File::_splice 67.1 1.930 1.974 1037 0.0019 0.0019 Tie::File::_oadjust 10.4 0.038 0.308 2204 0.0000 0.0001 Tie::File::FETCH 9.18 0.020 0.270 2204 0.0000 0.0001 Tie::File::_fetch 8.19 0.019 0.241 1037 0.0000 0.0002 Tie::File::_mtwrite 6.42 0.110 0.189 5 0.0220 0.0377 main::BEGIN 5.20 0.069 0.153 1037 0.0001 0.0001 Tie::File::_downcopy 4.08 0.120 0.120 2074 0.0001 0.0001 Tie::File::_write_record 3.33 0.034 0.098 2204 0.0000 0.0000 Tie::File::Cache::lookup 2.92 0.038 0.086 1172 0.0000 0.0001 Tie::File::Cache::insert 2.69 0.049 0.079 2074 0.0000 0.0000 Tie::File::_cache_flush 2.65 0.078 0.078 2348 0.0000 0.0000 Tie::File::_read_record 2.41 0.039 0.071 3134 0.0000 0.0000 Tie::File::FETCHSIZE

      Obviously, Tie::File::PUSH is killing me, and that is what I thought from the beginning as well, but it seems there's no way to do away with this, atleast with with my simple approach?

      Any pointers?

        Well, yea, Acolyte. The only thing you are doing with @arrayOfOther is pushing to it. So, forget that array and tie completely. Instead, open a file for writing, and replace the push with a print to that file.

        I'm supposing that there is a useful reason for tieing the input array, rather than just reading the whole thing into a real array object: the file is too large. So I'd not change that one unless it is a performance problem too.

        But the @arrayOfOther is only referenced 3 times in the code, and is a push in all cases. That's not a proper use of an array. That's an output stream.

        —John

Re: Improving dismal performance - Part 1
by rir (Vicar) on May 12, 2009 at 20:59 UTC
    This isn't complete, but shows a common approach.

    Be well
    rir

    # XXX lots of possibly magic string literals $| = 1; local $_; while (<DATA>) { # find next record next unless /^\s*Record /; # read a record $_ = ''; my @record; do { s/^ *//; push @record, $_ if $_ && $_ !~ /^\s*End of Sub/ && $_ !~ /^\ +s*Sub/; $_ = <DATA>; }while ( $_ !~ /^\s*End of Record/ ); # massage the raw data in @record # XXX incomplete for my $item ( @record ) { $item =~ s/" = "/ /; $item =~ s/"//g; $item = "F " . $item; } @record = sort @record; # output preamble print STDOUT "RECORD\n"; print STDOUT "#addkey\n"; print STDOUT "#filename FF\n"; # XXX more? # output @record print STDOUT @record; # output trailer print STDOUT ".\n"; }
      Yes, the script before this one (using Tie) looked exactly like this, and that is still what I use now after the new version failed to do anything useful in reasonable time, but I was hoping that maybe I could optimize this code somehow - using Tie makes things so easy - just like looping through an array!

      Well, I am not totally out of luck, as I am reading line by line anyways, and not much change would be required to translate it into a direct file IO .. but in case someone comes across a light bulb in their vicinity, keep this node updated!

Re: Improving dismal performance - Part 1
by GrandFather (Saint) on May 13, 2009 at 02:26 UTC

    I don't know about improving the performance, but the code can be cleaned up. Consider:

    use strict; use warnings; my %fieldNameMapping; my @recentLines; fetchLines (\@recentLines, 2); while (fetchLines (\@recentLines, 1)) { next if($recentLines[-3] !~ /Level \(1\) "([^"]+)"$/); next if($recentLines[-2] !~ /Level \(2\) "([^"]+)"$/); next if($recentLines[-1] !~ /Record \(([0-9]+)\) "([^"]+)"$/); my $recordID = $1; my %recordBodyToWriteOut ; my $recordHeader = "RECORD\n"; # First value in the header $recordHeader .= "#addkey\n#filename FF\n#input_id 001\n"; do { if(nextLine (\@recentLines) =~ /"([^"]+)" = "([^"]+)"$/) { if($1 eq 'MSC_CDR_TYPE') { $recordHeader .= "#input_type $2\n#output_id\n#output_ +type $2\n#source_id SRC\n"; } if(exists $fieldNameMapping{$1}) { $recordBodyToWriteOut{$fieldNameMapping{$1}} = $2; } else { $recordBodyToWriteOut{$1} = $2; } } } until ($recentLines[-3] =~ /End of Record \(${recordID}\)$/) && ($recentLines[-2] =~ /End of Level \(2\)$/) && ($recentLines[-1] =~ /End of Level \(1\)$/); my $recordTrailer = ".\n"; # First value in the Trailer fetchLines (\@recentLines, 2); print $recordHeader; foreach my $key (sort keys %recordBodyToWriteOut) { print "F $key " . $recordBodyToWriteOut { $key } . "\n"; } print $recordTrailer; } sub fetchLines { my ($lines, $toFetch) = @_; my $fetched = 0; while ($toFetch-- > 0) { my $newLine = nextLine ($lines); return $fetched > 0 if ! defined $newLine; ++$fetched; } return 1; } sub nextLine { my ($lines) = @_; my $line = <DATA>; return undef if ! defined $line; push @$lines, $line; shift @$lines while @$lines > 3; return $line; } __DATA__ Start of Data ********************************************************************** Level (1) "COMMONRec" Level (2) "MSCCDR" Record (1) "MSCGSMRec" "MSC_CDR_TYPE" = "MOC" "MSC_CALL_START_TIME" = "20090122105929" "MSC_CALL_END_TIME" = "20090122105944" "MSC_CALL_DURATION" = "15" "MSC_PARTIAL_INDICATOR" = "S" Sub Record (1) "AXECallDataRecord" "AX_DISCONNECT_PARTY" = "1" "AX_CHARGED_PARTY" = "0" "AX_TRANSLATED_TON" = "1" End of Sub Record (1) End of Record (1) End of Level (2) End of Level (1) Level (1) "COMMONRec" Level (2) "MSCCDR" Record (2) "MSCGSMRec" "MSC_CDR_TYPE" = "MTC" "MSC_PARTIAL_TYPE" = "0" "MSC_CALL_START_TIME" = "20090122105927" "MSC_CALL_END_TIME" = "20090122105945" "MSC_CALL_DURATION" = "18" "MSC_PARTIAL_INDICATOR" = "S" Sub Record (1) "AXECallDataRecord" "AX_DISCONNECT_PARTY" = "1" "AX_CHARGED_PARTY" = "0" "AX_SWITCH_IDENTITY" = "0001" "AX_RELATED_NUMBER" = "7F4595" "AX_FIRST_CALLED_LOC_INFO" = "25F010233203BE" End of Sub Record (1) End of Record (2) End of Level (2) End of Level (1)

    Prints:

    RECORD #addkey #filename FF #input_id 001 #input_type MOC #output_id #output_type MOC #source_id SRC F AX_CHARGED_PARTY 0 F AX_DISCONNECT_PARTY 1 F AX_TRANSLATED_TON 1 F MSC_CALL_DURATION 15 F MSC_CALL_END_TIME 20090122105944 F MSC_CALL_START_TIME 20090122105929 F MSC_CDR_TYPE MOC F MSC_PARTIAL_INDICATOR S . RECORD #addkey #filename FF #input_id 001 #input_type MTC #output_id #output_type MTC #source_id SRC F AX_CHARGED_PARTY 0 F AX_DISCONNECT_PARTY 1 F AX_FIRST_CALLED_LOC_INFO 25F010233203BE F AX_RELATED_NUMBER 7F4595 F AX_SWITCH_IDENTITY 0001 F MSC_CALL_DURATION 18 F MSC_CALL_END_TIME 20090122105945 F MSC_CALL_START_TIME 20090122105927 F MSC_CDR_TYPE MTC F MSC_PARTIAL_INDICATOR S F MSC_PARTIAL_TYPE 0 .

    Notes:

    our is not used at all. Use my instead. I doubt our does what you think it does.

    Bail as soon as possible instead of nesting conditional code. The logic is easier to follow and easier to get right.

    Declare variables where they first get a real value. That is, make their scope as small as possible.

    Don't provide default values for variables, unless it really is a default. In particular, Perl generates arrays and hashes empty and scalars with the value undef - you don't have to do that explicitly.


    True laziness is hard work
Re: Improving dismal performance - Part 1
by John M. Dlugosz (Monsignor) on May 12, 2009 at 19:28 UTC
    From the PerlMonks Markup page:
    If you find it necessary (and you should make every effort to ensure it is NOT) to create a very long post, please use <readmore> … </readmore tags. Should your post make it to the front page, the readmore segment collapses to a link by which the interested reader can see the entire post.
    You may want to edit your message and insert such a tag.

      woops! Fixed now.
Re: Improving dismal performance - Part 1
by apl (Monsignor) on May 12, 2009 at 19:48 UTC
    Replace multiple concatenations with a single statement. That is, replace
    $recordHeader = "RECORD\n"; # First value in the heade +r $recordHeader .= "#addkey\n"; $recordHeader .= "#filename FF\n"; $recordHeader .= "#input_id 001\n";
    with
    $recordHeader = "RECORD\n#addkey\n#filename FF\n#input +_id 001\n";
      No differrence in speed .. but made the change in the source :-)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://763564]
Approved by mr_mischief
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (5)
As of 2024-04-19 13:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found