need to optimize my sub routine

convenientstore has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks,

Now that My program runs, I need to optimize it. Currently this program runs through 120MB files(spreaded
into 11 files) and parse out the information and spits out the info/format that user need to see. It takes about 3 min and when I do dprofpp, I see that main () is taking the most of the time.

I copy my main routine below and as you can see it's not doing anything fancy.
My question is,
1)will writing this in C will prove to be much faster(just this parsing part) ? how much faster?
2)is there anything I can do to improve my main function ?

[root@myserver]# dprofpp
$Monfile is tmon.out
Exporter::Heavy::heavy_export
AutoLoader::__ANON__[/usr/lib/perl5/5.8.8/AutoLoader.pm:96]
Total Elapsed Time = 192.5636 Seconds
  User+System Time = 193.6643 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
 42.5   82.48 198.71      1   82.486 198.71  main::main
 29.8   57.81 56.779 194654   0.0003 0.0003  Text::CSV_XS::Parse
 12.9   24.98 23.929 194654   0.0001 0.0001  Text::CSV_XS::fields
 9.65   18.68 17.619 194654   0.0001 0.0001  Text::CSV_XS::new
 8.11   15.71 70.368 194654   0.0001 0.0004  Text::CSV_XS::parse
 2.42   4.687  3.660 187290   0.0000 0.0000  main::extract
 0.89   1.719  0.649 194654   0.0000 0.0000  AutoLoader::__ANON__[/usr
+/lib/perl
                                             5/5.8.8/AutoLoader.pm:96]
 0.03   0.060  0.079      4   0.0150 0.0198  main::BEGIN
 0.01   0.010  0.010      1   0.0100 0.0099  Exporter::as_heavy
 0.01   0.010  0.010     20   0.0005 0.0005  Getopt::Long::BEGIN
 0.00   0.000 -0.000      5   0.0000      -  strict::import
 0.00   0.000 -0.000      3   0.0000      -  Text::CSV_XS::BEGIN
 0.00   0.000 -0.000      7   0.0000      -  vars::import
 0.00   0.000 -0.000      1   0.0000      -  DynaLoader::bootstrap
 0.00   0.000 -0.000      1   0.0000      -  DynaLoader::dl_load_flags






sub main {
      for (@files) {
          open  ( NOW , "$directory/$_" ) || die "you suck\n";
           while (<NOW>) {
                  my (%rec,%HoH);
                  my $p;
                  chomp;
                  $t_counter++;

                  my $csv = Text::CSV_XS->new;
                  $csv->parse($_);
                  my @fields = $csv->fields;


                  if (/^STOP/) {
                       @rec{@attrs_sto} = @fields[0,1,13,14,16,20,33,3
+4,36,67];
                       if ($rec{_i_pstn_trunk}) {
                              $p = extract($rec{_i_pstn_circuit}, $rec
+{_i_pstn_trunk});
                              $HoH{$p} =  {%rec};
                       } elsif ($rec{_e_pstn_trunk}) {
                              $p = extract($rec{_e_pstn_circuit}, $rec
+{_e_pstn_trunk});
                              $HoH{$p} =  {%rec};
                       } else {
                              $ncounter++;
                       }
                  } elsif (/^START/) {
                       @rec{@attrs_sta} = @fields[0,1,11,15,28,29,31,5
+3];
                       if ($rec{_i_pstn_trunk}) {
                         $p = extract($rec{_i_pstn_circuit}, $rec{_i_p
+stn_trunk});
                         $HoH{$p} =  {%rec};
                       } elsif ($rec{_e_pstn_trunk}) {
                          $p = extract($rec{_e_pstn_circuit}, $rec{_e_
+pstn_trunk});
                          $HoH{$p} =  {%rec};
                       } else {
                          $ncounter++;
                       }
                  } elsif (/^ATTEMPT/)  {
                       @rec{@attrs_att} = @fields[0,1,11,13,17,30,31,3
+3,57];
                       if ($rec{_i_pstn_trunk}) {
                          $p = extract($rec{_i_pstn_circuit}, $rec{_i_
+pstn_trunk});
                          $HoH{$p} =  {%rec};
                       } elsif ($rec{_e_pstn_trunk}) {
                          $p = extract($rec{_e_pstn_circuit}, $rec{_e_
+pstn_trunk});
                          $HoH{$p} =  {%rec};
                       }  else  {
                          $ncounter++;
                       }
                  } else {
                       $ncounter++;
                  }
                  push @data, {%HoH};
          }
        close NOW;
     }
}


sub extract {
         return join('-', (split(/:/, $_[0]))[1], $_[1]);
}

1653222 -rw-r--r-- 1 1036  101 10773472 Feb 14 12:00 100A8B9.F
1653223 -rw-r--r-- 1 1036  101 11110758 Feb 14 12:05 100A8BA.F
1653224 -rw-r--r-- 1 1036  101 11106128 Feb 14 12:10 100A8BB.F
1653225 -rw-r--r-- 1 1036  101 10851079 Feb 14 12:15 100A8BC.F
1653226 -rw-r--r-- 1 1036  101 10758864 Feb 14 12:20 100A8BD.F
1653227 -rw-r--r-- 1 1036  101 10665272 Feb 14 12:25 100A8BE.F
1653228 -rw-r--r-- 1 1036  101 10722126 Feb 14 12:30 100A8BF.F
1653229 -rw-r--r-- 1 1036  101 10204733 Feb 14 12:35 100A8C0.F
1653230 -rw-r--r-- 1 1036  101 10292893 Feb 14 12:40 100A8C1.F
1653231 -rw-r--r-- 1 1036  101  9990122 Feb 14 12:45 100A8C2.F
1653232 -rw-r--r-- 1 1036  101 10073364 Feb 14 12:50 100A8C3.F
1653233 -rw-r--r-- 1 1036  101 10188466 Feb 14 12:55 100A8C4.F
[download]

Comment on need to optimize my sub routine Download Code

Replies are listed 'Best First'.
Re: need to optimize my sub routine by pc88mxer (Vicar) on Feb 19, 2008 at 22:32 UTC
This should speed it up by about 10%: `my $csv = Text::CSV_XS->new;` [download] You can move this outside of your loop. You only need to create it once. Also, I don't think re-writing this in C will gain you much. You'll just be re-implementing a lot of what perl already does and you'll have to spend the time to debug and verify it. The code as it stands is very readable and maintainable. If you need a factor of 2 (or even 3 or 4) speed up, just run the process in parallel on another box. It'll be less work for you and a lot more cost effective for your company even if they have to buy more hardware. Another option is to process the files incrementally as they are generated (if you're not already doing this.)	[reply] [d/l]
Re: need to optimize my sub routine by BrowserUk (Patriarch) on Feb 19, 2008 at 22:55 UTC
That subroutine has large amounts of cut&paste code that could be reduced, but mostly that wouldn't affect the performance. Two changes that might improve it are: Three occurrences of `$HoH{$p} = {%rec};` could become `$HoH{$p} = \%rec;` without affecting the program, but avoiding replication of existing hashes. One occurance of `push @data, {%HoH};` can become `push @data, \%HoH;` for the same reason. This is total speculation given the number of variables used within this sub but not declared there, so take it with a pinch of salt, but the subroutine may be reducable to something along the lines of: sub main { my $csv = Text::CSV_XS->new; for( @files ) { open( NOW , "$directory/$_" ) \|\| die "you suck\n"; while( <NOW> ) { chomp; my( %rec, %HoH, $p ); $t_counter++; my( $attrsRef, @selectedFields ) = /^STOP/ ? ( \@attrs_sto, 0,1,13,14,16,20,33,34,36, +67 ) : /^START/ ? ( \@attrs_sta, 0,1,11,15,28,29,31,53 + ) : /^ATTEMPT/ ? ( \@attrs_att, 0,1,11,13,17,30,31,33,57 + ) : $ncounter++; ; my %rec{ @$attrs_ref ) = @fields{ @selectedFields ); my $p = $rec{ _i_pstn_trunk } ? extract( $rec{ _i_pstn_circuit + }, $rec{ _i_pstn_trunk } ) : $rec{ _e_pstn_trunk } ? extract( $rec{ _e_pstn_circuit + }, $rec{ _e_pstn_trunk } ) : $ncounter++ ; $csv->parse( $_ ); my @fields = $csv->fields; push @data, \%HoH; } close NOW; } } [download] The main unknown is the significance of the variable `$ncounter`? (BTW: Naming a variable $somethingcounter is ... That it is a counter is obvious from it's usage, but one letter to signify what it is counting, when it is declared at a higher level, does not convey much information. On similar lines, having arrays called `@attrs_sto`, `@attrs_sta`, and `@attrs_att`, presumably STOP, START and ATTEMPT, minimises the important information (to just one character sometimes) and maximises the less important common element. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re^2: need to optimize my sub routine by convenientstore (Pilgrim) on Feb 20, 2008 at 02:11 UTC
hmm, did not see this post and very interesting I will definitly change them as well and try it btw, $HoH{$p} = {%rec} isn't same as $HoH{$p} = \%rec ?	[reply]
Re^3: need to optimize my sub routine by BrowserUk (Patriarch) on Feb 20, 2008 at 02:41 UTC
btw, $HoH{$p} = {%rec} isn't same as $HoH{$p} = \%rec ? No. The former creates a new anonymous hash by copying the contents of `%rec` into it, and then stores a reference to that anonymous hash into `$HoH{$p}`. The latter, just stores a reference to the existing `%rec` into `$HoH{$p}`. As `%rec` is declared within the while loop, you will get an empty hash each time the loop iterates and so there is no need to copy it's contents, just take a reference. No copying means less work, so more efficient. How much difference it makes will depend on how large the hashes are. On the evidence of the script, probably not much individually, but over many iterations it may be significant enough to be worth while. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re^4: need to optimize my sub routine by convenientstore (Pilgrim) on Feb 20, 2008 at 04:05 UTC
Re: need to optimize my sub routine by grinder (Bishop) on Feb 19, 2008 at 22:43 UTC
You have used Devel::DProf, that's a good start. It has told you that most of the time is spent in `main()`. Unfortunately, and perhaps not unsurprisingly, `main` also appears to contain most of the code. To obtain a more fine-grained picture of what is going on, you need to employ line-based profiling. For this, Devel::SmallProf will come in very handy. This will give you an idea of which lines are being hit hard, and which lines are expensive to run. This may give you sufficient information to reconsider your algorithm. • another intruder with the mooring in the heart of the Perl	[reply]
Re^2: need to optimize my sub routine by convenientstore (Pilgrim) on Feb 20, 2008 at 02:04 UTC
thank you pc88mxer and hipowls I have moved it out of the loop and shaved off about close to 30 sec!! grinder, I will also check your suggestion All in all, I will just go back to developer and tell him that it might be just faster to run it on faster hardware `real 2m45.342s user 2m43.562s sys 0m1.684s sub main { my $csv = Text::CSV_XS->new; for (@files) { open ( NOW , "$directory/$_" ) \|\| die "you suck\n"; while (<NOW>) { my (%rec,%HoH); my $p; chomp; $t_counter++; $csv->parse($_); my @fields = $csv->fields;` [download]	[reply] [d/l]
Re: need to optimize my sub routine by hipowls (Curate) on Feb 19, 2008 at 22:35 UTC
I'd move `my $csv = Text::CSV_XS->new;` to outside the loops. You don't need to create one for every line you read.	[reply] [d/l]
Re: need to optimize my sub routine by graff (Chancellor) on Feb 20, 2008 at 02:48 UTC
One thing that "dprofpp" is not telling you about is total memory usage, and page faults -- which are what you get when parts of the app's storage need to be "paged out" to the system's "virtual-memory / swap" disk file. In other words, things will slow down a lot when the process needs more storage than is available in RAM, because it takes a lot of disk i/o to keep all the process data intact. I don't know about windows, but on unix/linux/macosx, you can run "top" in a separate terminal while your process is active, and watch what happens in terms of memory consumption and page faults, in addition to overall cpu load. Perl data structures (like your AoHoH "@data") take up a lot more space than you might expect. 120 MB of data in disk files might take up two or three (or more?) times that amount inside the perl process, because of all the overhead associated with managing scalar values and nested structures. So figure out whether you really need all your file data to be (virtual) memory resident at the same time in that one huge AoHoH structure. Maybe only some of the data from each csv row needs to be kept, or maybe the processing can be done serially (i.e. while reading each file)?	[reply]
Re: need to optimize my sub routine by Tux (Canon) on Feb 20, 2008 at 08:55 UTC
Simplest speedup should be to move from reading line-by-line and using `parse ()` to use `getline ()`. This will pay off even more when you allow binary and/or embedded newlines. I did a small benchmark on my machine: /home/merijn> cat test.pl #!/pro/bin/perl use strict; use warnings; use Benchmark qw( cmpthese ); use Text::CSV_XS; use IO::Handle; my $csv = Text::CSV_XS->new; my @f; sub diamond { open my $io, "<", "test.csv" or die "test.csv: $!"; while (<$io>) { $csv->parse ($_); @f = $csv->fields; } } # diamond sub intern { open my $io, "<", "test.csv" or die "test.csv: $!"; while (my $row = $csv->getline ($io)) { @f = @$row; } } # intern cmpthese (-5, { "diamond" => \&diamond, "getline" => \&intern }); /home/merijn> wc -l test.csv 12000 test.csv /home/merijn> perl test.pl Rate diamond getline diamond 6.89/s -- -39% getline 11.3/s 64% -- /home/merijn> [download] You can use the first field to do your after-matches. Enjoy, Have FUN! H.Merijn	[reply] [d/l] [select]
Re^2: need to optimize my sub routine by convenientstore (Pilgrim) on Feb 21, 2008 at 01:09 UTC
hey, thanks, this is another challenge for me I will look at your code and figure out how to incoporate that into mine and let you know the results This is fantastic!!!!	[reply]
Re^2: need to optimize my sub routine by convenientstore (Pilgrim) on Feb 21, 2008 at 05:51 UTC
This has definitely improved the time. But I lost some accuracy Perhaps that's due to my inaccurate calculation on other sub, but this cut down my time down to 115 sec. Amazing sub main { my $csv = Text::CSV_XS->new; for (@files) { open my $io ,"<", "$directory/$_" \|\| die "you suck\n"; DOIT: while (my $it = $csv->getline ($io)) { my (%rec,%HoH); my $p; chomp; $t_counter++; my @fields = @$it; if ($fields[0] =~ /^STOP/) { @rec{@attrs_sto} = @fields[0,1,13,14,16,20,33,3 +4,36,67]; if ($rec{_i_pstn_trunk}) { $p = extract($rec{_i_pstn_circuit}, $rec +{_i_pstn_trunk}); $HoH{$p} = \%rec; } elsif ($rec{_e_pstn_trunk}) { $p = extract($rec{_e_pstn_circuit}, $rec +{_e_pstn_trunk}); $HoH{$p} = \%rec; } else { next DOIT; } } elsif ($fields[0] = /^START/) { @rec{@attrs_sta} = @fields[0,1,11,15,28,29,31,5 +3]; if ($rec{_i_pstn_trunk}) { $p = extract($rec{_i_pstn_circuit}, $rec{_i_p +stn_trunk}); $HoH{$p} = \%rec; } elsif ($rec{_e_pstn_trunk}) { $p = extract($rec{_e_pstn_circuit}, $rec{_e_ +pstn_trunk}); $HoH{$p} = \%rec; } else { next DOIT; } } elsif ($fields[0] =~ /^ATTEMPT/) { @rec{@attrs_att} = @fields[0,1,11,13,17,30,31,3 +3,57]; if ($rec{_i_pstn_trunk}) { $p = extract($rec{_i_pstn_circuit}, $rec{_i_ +pstn_trunk}); $HoH{$p} = \%rec; } elsif ($rec{_e_pstn_trunk}) { $p = extract($rec{_e_pstn_circuit}, $rec{_e_ +pstn_trunk}); $HoH{$p} = \%rec; } else { next DOIT; } } else { next DOIT; } push @data, \%HoH; } } [download]	[reply] [d/l]
Re^3: need to optimize my sub routine by Tux (Canon) on Feb 21, 2008 at 07:20 UTC
I don't know how well the optimizer works, but if you really need the last milklisecond, don't get the fields out, but keep working with the reference. Also drop the chomp. You're not working with C<$_> anymore. `DOIT: while (my $it = $csv->getline ($io)) { my (%rec, %HoH, $p); # chomp; # No need to chomp anymore! $t_counter++; # my @fields = @$it; # Don't make a copy if ($it->[0] =~ /^STOP/) { # use the reference : } elsif ($it->[0] ...` [download] Enjoy, Have FUN! H.Merijn	[reply] [d/l]