PoorLuzer has asked for the wisdom of the Perl Monks concerning the following question:
I have this script I wrote to convert from one ASCII file format to another.
I have done the coding so bad that I get only 10kb/min output :-(
I tried to Dprof the code, but it seems unless you make the perl code terminate normally, dprof output is not of much use.
I use Tie::File to open the source file as well as create the output file.
Here is the code:
#!/usr/bin/perl
# top stats
#
# CPU TTY PID USERNAME PRI NI SIZE RES STATE TIME %WCPU %
+CPU COMMAND
# 3 pts/12 13833 me 241 20 30444K 18760K run 0:13 52.90 39.85
+perl
# 3 pts/12 13833 me 241 20 45036K 33356K run 0:27 62.02 56.93
+perl
# 0 pts/12 13833 me 241 20 48748K 37116K run 1:34 78.13 77.99
+perl
# 3 pts/12 13833 me 241 20 53996K 42364K run 5:40 71.00 70.88
+perl
# 3 pts/12 13833 me 241 20 72172K 60460K run 44:38 72.95 72.83
+perl
#
# Some file stats
#
# -rw-r--r-- 1 me mine 2100352 May 12 11:47 mineOut
# -rw-r--r-- 1 me mine 221005 May 12 12:56 mineoutput.co
+nverted.to.other
#
# -rw-r--r-- 1 me mine 2100352 May 12 11:47 mineOut
# -rw-r--r-- 1 me mine 239670 May 12 12:57 mineoutput.co
+nverted.to.other
#
# -rw-r--r-- 1 me mine 2100352 May 12 11:47 mineOut
# -rw-r--r-- 1 me mine 261315 May 12 12:58 mineoutput.co
+nverted.to.other
#
# -rw-r--r-- 1 me mine 2100352 May 12 11:47 mineOut
# -rw-r--r-- 1 me mine 989435 May 12 13:59 mineoutput.co
+nverted.to.other
#
# Thoroughput is around 18665 bytes to 21645 bytes per min -> ~20kb/mi
+n
# At this rate 2100352 bytes output will take 113 mins !
# Reality check : 728120 bytes in 1 hour (from 12:58 to 13:59) : 72812
+0 / 60 = 12135 bytes / min -> ~10kb/min
use strict;
use warnings;
use Tie::File;
use Data::Dumper;
# open an existing file in read-only mode
use Fcntl 'O_RDONLY';
# Unfortunately it seems, mine and other field names are different. He
+nce, we create a map between the two and replace the mine field name
+with the other one whereever available
# This is how you do the mapping
# mine
+ <-> other
# If your mine and other field names are same, keep this mapping empty
our %fieldNameMapping = ();
# qw(
# MSC_CDR_TYPE
+ RECORD_TYPE
# MSC_CDR_SEQ_NUM
+ callIdentificationNumber
# MSC_CDR_REFER_NUM
+ networkCallRef
# MSC_CALL_START_TIME
+ start_date_time_format
# MSC_CALL_DURATION
+ charge_duration_secs
# MSC_PARTIAL_TYPE
+ msc_partial_type
# AX_FIRST_CALLED_LOC_INFO
+ firstCalledLocInformation
# );
# Put the remaining fields
our @array;
tie @array, 'Tie::File', 'inp', memory => 50_000_000, mode => O_RDONLY
+, recsep => "\n" or die $!;
our @arrayOfother = ();
tie @arrayOfother, 'Tie::File', 'mineoutput.converted.to.other' or die
+ $!;
our $dx = 0;
our $recordID = 0;
our $recordHeader = undef;
our %recordBodyToWriteOut = ();
our $recordTrailer = undef;
for($dx = 0; $dx < @array; ++$dx)
{
#if($array[$dx++] =~ /Level \(([0-9]+)\) "([^"]+)"/)
if($array[1 + $dx] =~ /Level \(1\) "([^"]+)"$/)
{
if($array[2 + $dx] =~ /Level \(2\) "([^"]+)"$/)
{
if($array[3 + $dx] =~ /Record \(([0-9]+)\) "([^"]+)"$/)
{
$recordID = $1;
print STDERR "[*]Got record type $2, number $recordID\
+n";
# Write out the record in other format until we get en
+d of record
$dx += 3;
#print "RECORD\n";
$recordHeader = "RECORD\n"; # First value in the heade
+r
$recordHeader .= "#addkey\n#filename FF\n#input_id 001
+\n";
%recordBodyToWriteOut = (); # Reset the record body
do
{
if($array[$dx++] =~ /"([^"]+)" = "([^"]+)"$/)
{
if($1 eq 'MSC_CDR_TYPE')
{
$recordHeader .= "#input_type $2\n#out
+put_id\n#output_type $2\n#source_id SRC\n";
}
if(exists($fieldNameMapping { $1 }))
{
#print "F " . $fieldNameMapping { $1 }
+ . " $2\n";
$recordBodyToWriteOut { $fieldNameMapp
+ing { $1 } } = $2;
}
else
{
#print "F $1 $2\n";
$recordBodyToWriteOut { $1 } = $2;
}
}
}
until(
($array[1 + $dx] =~ /End of Record \(${recordI
+D}\)$/)
&&
($array[2 + $dx] =~ /End of Level \(2\)$/)
&&
($array[3 + $dx] =~ /End of Level \(1\)$/)
);
$recordTrailer = ".\n"; # First value in the Trai
+ler
$dx += 2;
# Now write out the header, fields and trailer
#print $recordHeader;
push @arrayOfother, $recordHeader;
# We want the fields to come out in sorted order
foreach my $key (sort keys %recordBodyToWriteOut)
{
#print "F $key " . $recordBodyToWriteOut { $ke
+y } . "\n";
push @arrayOfother, "F $key " . $recordBodyToW
+riteOut { $key } . "\n";
}
#print $recordTrailer;
push @arrayOfother, $recordTrailer;
}
}
}
}
Here is some input data:
Start of Data
**********************************************************************
Level (1) "COMMONRec"
Level (2) "MSCCDR"
Record (1) "MSCGSMRec"
"MSC_CDR_TYPE" = "MOC"
"MSC_CALL_START_TIME" = "20090122105929"
"MSC_CALL_END_TIME" = "20090122105944"
"MSC_CALL_DURATION" = "15"
"MSC_PARTIAL_INDICATOR" = "S"
Sub Record (1) "AXECallDataRecord"
"AX_DISCONNECT_PARTY" = "1"
"AX_CHARGED_PARTY" = "0"
"AX_TRANSLATED_TON" = "1"
End of Sub Record (1)
End of Record (1)
End of Level (2)
End of Level (1)
Level (1) "COMMONRec"
Level (2) "MSCCDR"
Record (2) "MSCGSMRec"
"MSC_CDR_TYPE" = "MTC"
"MSC_PARTIAL_TYPE" = "0"
"MSC_CALL_START_TIME" = "20090122105927"
"MSC_CALL_END_TIME" = "20090122105945"
"MSC_CALL_DURATION" = "18"
"MSC_PARTIAL_INDICATOR" = "S"
Sub Record (1) "AXECallDataRecord"
"AX_DISCONNECT_PARTY" = "1"
"AX_CHARGED_PARTY" = "0"
"AX_SWITCH_IDENTITY" = "0001"
"AX_RELATED_NUMBER" = "7F4595"
"AX_FIRST_CALLED_LOC_INFO" = "25F010233203BE"
End of Sub Record (1)
End of Record (2)
End of Level (2)
End of Level (1)
Here is some output:
RECORD
#addkey
#filename FF
#input_id 001
#input_type MOC
#output_id
#output_type MOC
#source_id SRC
F AX_CHARGED_PARTY 0
F AX_DISCONNECT_PARTY 1
F AX_TRANSLATED_TON 1
F MSC_CALL_DURATION 15
F MSC_CALL_END_TIME 20090122105944
F MSC_CALL_START_TIME 20090122105929
F MSC_CDR_TYPE MOC
F MSC_PARTIAL_INDICATOR S
.
RECORD
#addkey
#filename FF
#input_id 001
#input_type MTC
#output_id
#output_type MTC
#source_id SRC
F AX_CHARGED_PARTY 0
F AX_DISCONNECT_PARTY 1
F AX_FIRST_CALLED_LOC_INFO 25F010233203BE
F AX_RELATED_NUMBER 7F4595
F AX_SWITCH_IDENTITY 0001
F MSC_CALL_DURATION 18
F MSC_CALL_END_TIME 20090122105945
F MSC_CALL_START_TIME 20090122105927
F MSC_CDR_TYPE MTC
F MSC_PARTIAL_INDICATOR S
F MSC_PARTIAL_TYPE 0
.
Is this completely hopeless? Should I not use a hash to store the field-value pairs? Should I not use Tie::File and store the file contents in the array?
Any other optimizations you can suggest? The error message "Use of uninitialized value within @array in pattern match (m//) .. at line 75" can always be done last.
Re: Improving dismal performance - Part 1
by ELISHEVA (Prior) on May 12, 2009 at 21:46 UTC
|
There are several reasons why this script is slow and Tie::Array is only one of them.
- You are processing each line at least twice. When you run into a line for the first level you look ahead and parse to the end of the record. At this point you have processed N lines. But instead of skipping ahead N lines your for loop advances just one line. This means you visit each line a second time and compare it against a regular expression.
- Your regular expressions are capturing values you never use.
- You are reading and parsing the line with the MSC_CDR_TYPE 2x. You could avoid this by placing properties in a hash as you read each line.
By keeping a bit of state and storing field values in a hash as you find them, you can completely eliminate the need to use an array and most of the internal if/else statements and loops as well. Here is much simplified version of your parser:
use strict;
use warnings;
sub printRecord;
#--------------------------------------------------
# Parsing loop
#--------------------------------------------------
my $fhOut = \*STDOUT;
my $iLevel=0;
my %hFields;
while (my $sLine = <DATA>) {
#if line defines the level, set level
if ($sLine =~ /^\s*(?:Level|Record|Sub Record)\s+\(\d+\)/) {
$iLevel++;
} elsif ($sLine =~ /^\s*End of/) {
$iLevel--;
} else {
my ($k, $v) = $sLine =~ /\s+\"(\w+)\"\s+=\s+\"([^"]*)\"/;
$hFields{$k}=$v;
}
#if level back to 0, dump record
if ($iLevel == 0) {
printRecord($fhOut, \%hFields);
%hFields=();
}
}
#--------------------------------------------------
# SUBROUTINE DEFINITIONS
#--------------------------------------------------
sub printRecord {
my ($fhOut, $hFields) = @_;
my $sIOType = $hFields->{MSC_CDR_TYPE};
print $fhOut "RECORD\n";
print $fhOut "#addkey\n";
print $fhOut "#filename FF\n";
print $fhOut "#input_id 001\n";
print $fhOut "#input_type $sIOType\n";
print $fhOut "#output_id\n";
print $fhOut "#output_type $sIOType\n";
print $fhOut "#source_id SRC\n";
foreach my $k (sort keys %$hFields) {
my $v = $hFields->{$k};
print $fhOut "F $k $v\n";
}
print $fhOut ".\n";
}
#cut and paste sample data from above
__DATA__
| [reply] [d/l] |
|
Hi!
Excellent analysis, thanks!
Well I will try out all your suggestions and complete code later today, but one MAJOR change I did was to just change the output file from using Tie::File to normal file IO (using open)
.. and the performance got multiplied by approx 7 - 10 times.
This of course becomes apparent once you look at the profiler output I pasted above.
I ran the profiler on a file having some 14k "records" (839754 lines) and the tmon results:
Total Elapsed Time = 456.5656 Seconds
User+System Time = 206.2156 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
20.2 41.73 41.732 167950 0.0000 0.0000 Tie::File::_read_record
14.0 28.96 154.93 157920 0.0000 0.0001 Tie::File::_fetch
12.4 25.71 180.65 157920 0.0000 0.0001 Tie::File::FETCH
10.2 21.16 35.610 157920 0.0000 0.0000 Tie::File::Cache::lookup
7.16 14.77 53.769 839753 0.0000 0.0001 Tie::File::Cache::insert
5.64 11.62 12.893 839755 0.0000 0.0000 Tie::File::_seek
5.60 11.54 38.998 839753 0.0000 0.0000 Tie::File::Heap::insert
5.33 10.99 10.991 839753 0.0000 0.0000 Tie::File::Cache::_heap_m
+ove
5.13 10.57 23.874 839753 0.0000 0.0000 Tie::File::Heap::_insert_
+new
3.30 6.796 9.688 739447 0.0000 0.0000 Tie::File::Heap::promote
3.25 6.701 24.732 1 6.7008 24.731 Tie::File::_fill_offsets
3.14 6.474 6.474 157920 0.0000 0.0000 Tie::File::Heap::_nseq
2.31 4.756 14.444 739447 0.0000 0.0000 Tie::File::Heap::lookup
1.12 2.311 2.311 839753 0.0000 0.0000 Tie::File::Heap::_nelts_i
+nc
0.62 1.271 1.271 839760 0.0000 0.0000 Fcntl::__ANON__
| [reply] [d/l] |
Re: Improving dismal performance - Part 1
by jethro (Monsignor) on May 12, 2009 at 20:15 UTC
|
I'm pretty sure tie is not very fast. Remember that tie means you have to call a subroutine every time you access a tied value. Even the size test ("$dx<@array") inside your for loop might be more expensive than you think
Tie has its use when you have to jump arround in a document that is to large to fit into memory, but you seem to parse the file sequentially. I assume that at least the file is too big to always fit into memory, otherwise you simply could read it in in one chunk
So better read the file line by line. If you need to do lookaheads (i.e. check $dx+3), use an array as fifo or better drop the lookaheads. You should be able to do that in most cases if you for example parse the file with a state machine.
A state machine is simply a single variable, the state variable. You change this value depending on the lines you see. So a value of 5 could mean 'I just parsed a "record" statement and expect MSC thingies now'. Depending on what you parse, the state variable would then change again or even stay in the same state. And the parser would be a big switch or if-then-else construct. Check wikipedia for more info
UPDATE: Corrected example
| [reply] |
|
I would be willing to bet money on this theory.
| [reply] |
Re: Improving dismal performance - Part 1
by tim.bunce (Scribe) on May 12, 2009 at 20:35 UTC
|
DProf is broken. Use Devel::NYTProf instead. | [reply] |
Re: Improving dismal performance - Part 1
by John M. Dlugosz (Monsignor) on May 12, 2009 at 19:31 UTC
|
You might try getting it to terminate normally, at least on a small set of test data. Even if it's not doing its full and normal job, at least run through the code you want to profile.
—John | [reply] |
|
I ran it on a input of 20 "records" (1173 lines).
Here is the output:
Total Elapsed Time = 5.231984 Seconds
User+System Time = 2.941984 Seconds
Inclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
85.7 0.014 2.524 1037 0.0000 0.0024 Tie::File::PUSH
84.8 0.014 2.496 1037 0.0000 0.0024 Tie::File::SPLICE
83.1 0.133 2.447 1037 0.0001 0.0024 Tie::File::_splice
67.1 1.930 1.974 1037 0.0019 0.0019 Tie::File::_oadjust
10.4 0.038 0.308 2204 0.0000 0.0001 Tie::File::FETCH
9.18 0.020 0.270 2204 0.0000 0.0001 Tie::File::_fetch
8.19 0.019 0.241 1037 0.0000 0.0002 Tie::File::_mtwrite
6.42 0.110 0.189 5 0.0220 0.0377 main::BEGIN
5.20 0.069 0.153 1037 0.0001 0.0001 Tie::File::_downcopy
4.08 0.120 0.120 2074 0.0001 0.0001 Tie::File::_write_record
3.33 0.034 0.098 2204 0.0000 0.0000 Tie::File::Cache::lookup
2.92 0.038 0.086 1172 0.0000 0.0001 Tie::File::Cache::insert
2.69 0.049 0.079 2074 0.0000 0.0000 Tie::File::_cache_flush
2.65 0.078 0.078 2348 0.0000 0.0000 Tie::File::_read_record
2.41 0.039 0.071 3134 0.0000 0.0000 Tie::File::FETCHSIZE
Obviously, Tie::File::PUSH is killing me, and that is what I thought from the beginning as well, but it seems there's no way to do away with this, atleast with with my simple approach?
Any pointers? | [reply] [d/l] |
|
Well, yea, Acolyte. The only thing you are doing with @arrayOfOther is pushing to it. So, forget that array and tie completely. Instead, open a file for writing, and replace the push with a print to that file.
I'm supposing that there is a useful reason for tieing the input array, rather than just reading the whole thing into a real array object: the file is too large. So I'd not change that one unless it is a performance problem too.
But the @arrayOfOther is only referenced 3 times in the code, and is a push in all cases. That's not a proper use of an array. That's an output stream.
—John
| [reply] [d/l] [select] |
Re: Improving dismal performance - Part 1
by rir (Vicar) on May 12, 2009 at 20:59 UTC
|
This isn't complete, but shows a common approach.
Be well rir
# XXX lots of possibly magic string literals
$| = 1;
local $_;
while (<DATA>) {
# find next record
next unless /^\s*Record /;
# read a record
$_ = '';
my @record;
do {
s/^ *//;
push @record, $_ if $_ && $_ !~ /^\s*End of Sub/ && $_ !~ /^\
+s*Sub/;
$_ = <DATA>;
}while ( $_ !~ /^\s*End of Record/ );
# massage the raw data in @record
# XXX incomplete
for my $item ( @record ) {
$item =~ s/" = "/ /;
$item =~ s/"//g;
$item = "F " . $item;
}
@record = sort @record;
# output preamble
print STDOUT "RECORD\n";
print STDOUT "#addkey\n";
print STDOUT "#filename FF\n";
# XXX more?
# output @record
print STDOUT @record;
# output trailer
print STDOUT ".\n";
}
| [reply] [d/l] |
|
Yes, the script before this one (using Tie) looked exactly like this, and that is still what I use now after the new version failed to do anything useful in reasonable time, but I was hoping that maybe I could optimize this code somehow - using Tie makes things so easy - just like looping through an array!
Well, I am not totally out of luck, as I am reading line by line anyways, and not much change would be required to translate it into a direct file IO .. but in case someone comes across a light bulb in their vicinity, keep this node updated!
| [reply] |
Re: Improving dismal performance - Part 1
by GrandFather (Saint) on May 13, 2009 at 02:26 UTC
|
I don't know about improving the performance, but the code can be cleaned up. Consider:
use strict;
use warnings;
my %fieldNameMapping;
my @recentLines;
fetchLines (\@recentLines, 2);
while (fetchLines (\@recentLines, 1)) {
next if($recentLines[-3] !~ /Level \(1\) "([^"]+)"$/);
next if($recentLines[-2] !~ /Level \(2\) "([^"]+)"$/);
next if($recentLines[-1] !~ /Record \(([0-9]+)\) "([^"]+)"$/);
my $recordID = $1;
my %recordBodyToWriteOut ;
my $recordHeader = "RECORD\n"; # First value in the header
$recordHeader .= "#addkey\n#filename FF\n#input_id 001\n";
do {
if(nextLine (\@recentLines) =~ /"([^"]+)" = "([^"]+)"$/) {
if($1 eq 'MSC_CDR_TYPE') {
$recordHeader .= "#input_type $2\n#output_id\n#output_
+type $2\n#source_id SRC\n";
}
if(exists $fieldNameMapping{$1}) {
$recordBodyToWriteOut{$fieldNameMapping{$1}} = $2;
} else {
$recordBodyToWriteOut{$1} = $2;
}
}
} until
($recentLines[-3] =~ /End of Record \(${recordID}\)$/)
&& ($recentLines[-2] =~ /End of Level \(2\)$/)
&& ($recentLines[-1] =~ /End of Level \(1\)$/);
my $recordTrailer = ".\n"; # First value in the Trailer
fetchLines (\@recentLines, 2);
print $recordHeader;
foreach my $key (sort keys %recordBodyToWriteOut) {
print "F $key " . $recordBodyToWriteOut { $key } . "\n";
}
print $recordTrailer;
}
sub fetchLines {
my ($lines, $toFetch) = @_;
my $fetched = 0;
while ($toFetch-- > 0) {
my $newLine = nextLine ($lines);
return $fetched > 0 if ! defined $newLine;
++$fetched;
}
return 1;
}
sub nextLine {
my ($lines) = @_;
my $line = <DATA>;
return undef if ! defined $line;
push @$lines, $line;
shift @$lines while @$lines > 3;
return $line;
}
__DATA__
Start of Data
**********************************************************************
Level (1) "COMMONRec"
Level (2) "MSCCDR"
Record (1) "MSCGSMRec"
"MSC_CDR_TYPE" = "MOC"
"MSC_CALL_START_TIME" = "20090122105929"
"MSC_CALL_END_TIME" = "20090122105944"
"MSC_CALL_DURATION" = "15"
"MSC_PARTIAL_INDICATOR" = "S"
Sub Record (1) "AXECallDataRecord"
"AX_DISCONNECT_PARTY" = "1"
"AX_CHARGED_PARTY" = "0"
"AX_TRANSLATED_TON" = "1"
End of Sub Record (1)
End of Record (1)
End of Level (2)
End of Level (1)
Level (1) "COMMONRec"
Level (2) "MSCCDR"
Record (2) "MSCGSMRec"
"MSC_CDR_TYPE" = "MTC"
"MSC_PARTIAL_TYPE" = "0"
"MSC_CALL_START_TIME" = "20090122105927"
"MSC_CALL_END_TIME" = "20090122105945"
"MSC_CALL_DURATION" = "18"
"MSC_PARTIAL_INDICATOR" = "S"
Sub Record (1) "AXECallDataRecord"
"AX_DISCONNECT_PARTY" = "1"
"AX_CHARGED_PARTY" = "0"
"AX_SWITCH_IDENTITY" = "0001"
"AX_RELATED_NUMBER" = "7F4595"
"AX_FIRST_CALLED_LOC_INFO" = "25F010233203BE"
End of Sub Record (1)
End of Record (2)
End of Level (2)
End of Level (1)
Prints:
RECORD
#addkey
#filename FF
#input_id 001
#input_type MOC
#output_id
#output_type MOC
#source_id SRC
F AX_CHARGED_PARTY 0
F AX_DISCONNECT_PARTY 1
F AX_TRANSLATED_TON 1
F MSC_CALL_DURATION 15
F MSC_CALL_END_TIME 20090122105944
F MSC_CALL_START_TIME 20090122105929
F MSC_CDR_TYPE MOC
F MSC_PARTIAL_INDICATOR S
.
RECORD
#addkey
#filename FF
#input_id 001
#input_type MTC
#output_id
#output_type MTC
#source_id SRC
F AX_CHARGED_PARTY 0
F AX_DISCONNECT_PARTY 1
F AX_FIRST_CALLED_LOC_INFO 25F010233203BE
F AX_RELATED_NUMBER 7F4595
F AX_SWITCH_IDENTITY 0001
F MSC_CALL_DURATION 18
F MSC_CALL_END_TIME 20090122105945
F MSC_CALL_START_TIME 20090122105927
F MSC_CDR_TYPE MTC
F MSC_PARTIAL_INDICATOR S
F MSC_PARTIAL_TYPE 0
.
Notes:
our is not used at all. Use my instead. I doubt our does what you think it does.
Bail as soon as possible instead of nesting conditional code. The logic is easier to follow and easier to get right.
Declare variables where they first get a real value. That is, make their scope as small as possible.
Don't provide default values for variables, unless it really is a default. In particular, Perl generates arrays and hashes empty and scalars with the value undef - you don't have to do that explicitly.
True laziness is hard work
| [reply] [d/l] [select] |
Re: Improving dismal performance - Part 1
by John M. Dlugosz (Monsignor) on May 12, 2009 at 19:28 UTC
|
From the PerlMonks Markup page:
If you find it necessary (and you should make every effort to ensure it is NOT) to create a very long post, please use <readmore> … </readmore tags. Should your post make it to the front page, the readmore segment collapses to a link by which the interested reader can see the entire post.
You may want to edit your message and insert such a tag. | [reply] |
|
| [reply] |
Re: Improving dismal performance - Part 1
by apl (Monsignor) on May 12, 2009 at 19:48 UTC
|
Replace multiple concatenations with a single statement. That is, replace
$recordHeader = "RECORD\n"; # First value in the heade
+r
$recordHeader .= "#addkey\n";
$recordHeader .= "#filename FF\n";
$recordHeader .= "#input_id 001\n";
with
$recordHeader = "RECORD\n#addkey\n#filename FF\n#input
+_id 001\n";
| [reply] [d/l] [select] |
|
No differrence in speed .. but made the change in the source :-)
| [reply] |
|
|