Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

need to optimize my sub routine

by convenientstore (Pilgrim)
on Feb 19, 2008 at 21:50 UTC ( [id://668892]=perlquestion: print w/replies, xml ) Need Help??

convenientstore has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks,

Now that My program runs, I need to optimize it. Currently this program runs through 120MB files(spreaded
into 11 files) and parse out the information and spits out the info/format that user need to see. It takes about 3 min and when I do dprofpp, I see that main () is taking the most of the time.

I copy my main routine below and as you can see it's not doing anything fancy.
My question is,
1)will writing this in C will prove to be much faster(just this parsing part) ? how much faster?
2)is there anything I can do to improve my main function ?
[root@myserver]# dprofpp $Monfile is tmon.out Exporter::Heavy::heavy_export AutoLoader::__ANON__[/usr/lib/perl5/5.8.8/AutoLoader.pm:96] Total Elapsed Time = 192.5636 Seconds User+System Time = 193.6643 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 42.5 82.48 198.71 1 82.486 198.71 main::main 29.8 57.81 56.779 194654 0.0003 0.0003 Text::CSV_XS::Parse 12.9 24.98 23.929 194654 0.0001 0.0001 Text::CSV_XS::fields 9.65 18.68 17.619 194654 0.0001 0.0001 Text::CSV_XS::new 8.11 15.71 70.368 194654 0.0001 0.0004 Text::CSV_XS::parse 2.42 4.687 3.660 187290 0.0000 0.0000 main::extract 0.89 1.719 0.649 194654 0.0000 0.0000 AutoLoader::__ANON__[/usr +/lib/perl 5/5.8.8/AutoLoader.pm:96] 0.03 0.060 0.079 4 0.0150 0.0198 main::BEGIN 0.01 0.010 0.010 1 0.0100 0.0099 Exporter::as_heavy 0.01 0.010 0.010 20 0.0005 0.0005 Getopt::Long::BEGIN 0.00 0.000 -0.000 5 0.0000 - strict::import 0.00 0.000 -0.000 3 0.0000 - Text::CSV_XS::BEGIN 0.00 0.000 -0.000 7 0.0000 - vars::import 0.00 0.000 -0.000 1 0.0000 - DynaLoader::bootstrap 0.00 0.000 -0.000 1 0.0000 - DynaLoader::dl_load_flags sub main { for (@files) { open ( NOW , "$directory/$_" ) || die "you suck\n"; while (<NOW>) { my (%rec,%HoH); my $p; chomp; $t_counter++; my $csv = Text::CSV_XS->new; $csv->parse($_); my @fields = $csv->fields; if (/^STOP/) { @rec{@attrs_sto} = @fields[0,1,13,14,16,20,33,3 +4,36,67]; if ($rec{_i_pstn_trunk}) { $p = extract($rec{_i_pstn_circuit}, $rec +{_i_pstn_trunk}); $HoH{$p} = {%rec}; } elsif ($rec{_e_pstn_trunk}) { $p = extract($rec{_e_pstn_circuit}, $rec +{_e_pstn_trunk}); $HoH{$p} = {%rec}; } else { $ncounter++; } } elsif (/^START/) { @rec{@attrs_sta} = @fields[0,1,11,15,28,29,31,5 +3]; if ($rec{_i_pstn_trunk}) { $p = extract($rec{_i_pstn_circuit}, $rec{_i_p +stn_trunk}); $HoH{$p} = {%rec}; } elsif ($rec{_e_pstn_trunk}) { $p = extract($rec{_e_pstn_circuit}, $rec{_e_ +pstn_trunk}); $HoH{$p} = {%rec}; } else { $ncounter++; } } elsif (/^ATTEMPT/) { @rec{@attrs_att} = @fields[0,1,11,13,17,30,31,3 +3,57]; if ($rec{_i_pstn_trunk}) { $p = extract($rec{_i_pstn_circuit}, $rec{_i_ +pstn_trunk}); $HoH{$p} = {%rec}; } elsif ($rec{_e_pstn_trunk}) { $p = extract($rec{_e_pstn_circuit}, $rec{_e_ +pstn_trunk}); $HoH{$p} = {%rec}; } else { $ncounter++; } } else { $ncounter++; } push @data, {%HoH}; } close NOW; } } sub extract { return join('-', (split(/:/, $_[0]))[1], $_[1]); } 1653222 -rw-r--r-- 1 1036 101 10773472 Feb 14 12:00 100A8B9.F 1653223 -rw-r--r-- 1 1036 101 11110758 Feb 14 12:05 100A8BA.F 1653224 -rw-r--r-- 1 1036 101 11106128 Feb 14 12:10 100A8BB.F 1653225 -rw-r--r-- 1 1036 101 10851079 Feb 14 12:15 100A8BC.F 1653226 -rw-r--r-- 1 1036 101 10758864 Feb 14 12:20 100A8BD.F 1653227 -rw-r--r-- 1 1036 101 10665272 Feb 14 12:25 100A8BE.F 1653228 -rw-r--r-- 1 1036 101 10722126 Feb 14 12:30 100A8BF.F 1653229 -rw-r--r-- 1 1036 101 10204733 Feb 14 12:35 100A8C0.F 1653230 -rw-r--r-- 1 1036 101 10292893 Feb 14 12:40 100A8C1.F 1653231 -rw-r--r-- 1 1036 101 9990122 Feb 14 12:45 100A8C2.F 1653232 -rw-r--r-- 1 1036 101 10073364 Feb 14 12:50 100A8C3.F 1653233 -rw-r--r-- 1 1036 101 10188466 Feb 14 12:55 100A8C4.F

Replies are listed 'Best First'.
Re: need to optimize my sub routine
by pc88mxer (Vicar) on Feb 19, 2008 at 22:32 UTC
    This should speed it up by about 10%:

    my $csv = Text::CSV_XS->new;

    You can move this outside of your loop. You only need to create it once.

    Also, I don't think re-writing this in C will gain you much. You'll just be re-implementing a lot of what perl already does and you'll have to spend the time to debug and verify it. The code as it stands is very readable and maintainable. If you need a factor of 2 (or even 3 or 4) speed up, just run the process in parallel on another box. It'll be less work for you and a lot more cost effective for your company even if they have to buy more hardware.

    Another option is to process the files incrementally as they are generated (if you're not already doing this.)

Re: need to optimize my sub routine
by BrowserUk (Patriarch) on Feb 19, 2008 at 22:55 UTC

    That subroutine has large amounts of cut&paste code that could be reduced, but mostly that wouldn't affect the performance.

    Two changes that might improve it are:

    1. Three occurrences of $HoH{$p} =  {%rec}; could become $HoH{$p} =  \%rec; without affecting the program, but avoiding replication of existing hashes.
    2. One occurance of  push @data, {%HoH}; can become  push @data, \%HoH; for the same reason.

    This is total speculation given the number of variables used within this sub but not declared there, so take it with a pinch of salt, but the subroutine may be reducable to something along the lines of:

    sub main { my $csv = Text::CSV_XS->new; for( @files ) { open( NOW , "$directory/$_" ) || die "you suck\n"; while( <NOW> ) { chomp; my( %rec, %HoH, $p ); $t_counter++; my( $attrsRef, @selectedFields ) = /^STOP/ ? ( \@attrs_sto, 0,1,13,14,16,20,33,34,36, +67 ) : /^START/ ? ( \@attrs_sta, 0,1,11,15,28,29,31,53 + ) : /^ATTEMPT/ ? ( \@attrs_att, 0,1,11,13,17,30,31,33,57 + ) : $ncounter++; ; my %rec{ @$attrs_ref ) = @fields{ @selectedFields ); my $p = $rec{ _i_pstn_trunk } ? extract( $rec{ _i_pstn_circuit + }, $rec{ _i_pstn_trunk } ) : $rec{ _e_pstn_trunk } ? extract( $rec{ _e_pstn_circuit + }, $rec{ _e_pstn_trunk } ) : $ncounter++ ; $csv->parse( $_ ); my @fields = $csv->fields; push @data, \%HoH; } close NOW; } }

    The main unknown is the significance of the variable $ncounter? (BTW: Naming a variable $somethingcounter is ... That it is a counter is obvious from it's usage, but one letter to signify what it is counting, when it is declared at a higher level, does not convey much information.

    On similar lines, having arrays called @attrs_sto, @attrs_sta, and @attrs_att, presumably STOP, START and ATTEMPT, minimises the important information (to just one character sometimes) and maximises the less important common element.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      hmm, did not see this post and very interesting
      I will definitly change them as well and try it
      btw, $HoH{$p} = {%rec} isn't same as $HoH{$p} = \%rec ?
        btw, $HoH{$p} = {%rec} isn't same as $HoH{$p} = \%rec ?

        No.

        The former creates a new anonymous hash by copying the contents of %rec into it, and then stores a reference to that anonymous hash into $HoH{$p}.

        The latter, just stores a reference to the existing %rec into $HoH{$p}.

        As %rec is declared within the while loop, you will get an empty hash each time the loop iterates and so there is no need to copy it's contents, just take a reference.

        No copying means less work, so more efficient. How much difference it makes will depend on how large the hashes are. On the evidence of the script, probably not much individually, but over many iterations it may be significant enough to be worth while.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: need to optimize my sub routine
by grinder (Bishop) on Feb 19, 2008 at 22:43 UTC

    You have used Devel::DProf, that's a good start. It has told you that most of the time is spent in main(). Unfortunately, and perhaps not unsurprisingly, main also appears to contain most of the code.

    To obtain a more fine-grained picture of what is going on, you need to employ line-based profiling. For this, Devel::SmallProf will come in very handy. This will give you an idea of which lines are being hit hard, and which lines are expensive to run. This may give you sufficient information to reconsider your algorithm.

    • another intruder with the mooring in the heart of the Perl

      thank you pc88mxer and hipowls
      I have moved it out of the loop and shaved off about close to 30 sec!!
      grinder, I will also check your suggestion
      All in all, I will just go back to developer and tell him that it might be just faster to run it on faster hardware
      real 2m45.342s user 2m43.562s sys 0m1.684s sub main { my $csv = Text::CSV_XS->new; for (@files) { open ( NOW , "$directory/$_" ) || die "you suck\n"; while (<NOW>) { my (%rec,%HoH); my $p; chomp; $t_counter++; $csv->parse($_); my @fields = $csv->fields;
Re: need to optimize my sub routine
by hipowls (Curate) on Feb 19, 2008 at 22:35 UTC

    I'd move my $csv = Text::CSV_XS->new; to outside the loops. You don't need to create one for every line you read.

Re: need to optimize my sub routine
by graff (Chancellor) on Feb 20, 2008 at 02:48 UTC
    One thing that "dprofpp" is not telling you about is total memory usage, and page faults -- which are what you get when parts of the app's storage need to be "paged out" to the system's "virtual-memory / swap" disk file. In other words, things will slow down a lot when the process needs more storage than is available in RAM, because it takes a lot of disk i/o to keep all the process data intact.

    I don't know about windows, but on unix/linux/macosx, you can run "top" in a separate terminal while your process is active, and watch what happens in terms of memory consumption and page faults, in addition to overall cpu load.

    Perl data structures (like your AoHoH "@data") take up a lot more space than you might expect. 120 MB of data in disk files might take up two or three (or more?) times that amount inside the perl process, because of all the overhead associated with managing scalar values and nested structures.

    So figure out whether you really need all your file data to be (virtual) memory resident at the same time in that one huge AoHoH structure. Maybe only some of the data from each csv row needs to be kept, or maybe the processing can be done serially (i.e. while reading each file)?

Re: need to optimize my sub routine
by Tux (Canon) on Feb 20, 2008 at 08:55 UTC

    Simplest speedup should be to move from reading line-by-line and using parse () to use getline (). This will pay off even more when you allow binary and/or embedded newlines. I did a small benchmark on my machine:

    /home/merijn> cat test.pl #!/pro/bin/perl use strict; use warnings; use Benchmark qw( cmpthese ); use Text::CSV_XS; use IO::Handle; my $csv = Text::CSV_XS->new; my @f; sub diamond { open my $io, "<", "test.csv" or die "test.csv: $!"; while (<$io>) { $csv->parse ($_); @f = $csv->fields; } } # diamond sub intern { open my $io, "<", "test.csv" or die "test.csv: $!"; while (my $row = $csv->getline ($io)) { @f = @$row; } } # intern cmpthese (-5, { "diamond" => \&diamond, "getline" => \&intern }); /home/merijn> wc -l test.csv 12000 test.csv /home/merijn> perl test.pl Rate diamond getline diamond 6.89/s -- -39% getline 11.3/s 64% -- /home/merijn>

    You can use the first field to do your after-matches.


    Enjoy, Have FUN! H.Merijn
      hey, thanks, this is another challenge for me
      I will look at your code and figure out how to incoporate that into mine and let you know the results
      This is fantastic!!!!
      This has definitely improved the time. But I lost some accuracy
      Perhaps that's due to my inaccurate calculation on other sub, but this cut down my time down to 115 sec. Amazing
      sub main { my $csv = Text::CSV_XS->new; for (@files) { open my $io ,"<", "$directory/$_" || die "you suck\n"; DOIT: while (my $it = $csv->getline ($io)) { my (%rec,%HoH); my $p; chomp; $t_counter++; my @fields = @$it; if ($fields[0] =~ /^STOP/) { @rec{@attrs_sto} = @fields[0,1,13,14,16,20,33,3 +4,36,67]; if ($rec{_i_pstn_trunk}) { $p = extract($rec{_i_pstn_circuit}, $rec +{_i_pstn_trunk}); $HoH{$p} = \%rec; } elsif ($rec{_e_pstn_trunk}) { $p = extract($rec{_e_pstn_circuit}, $rec +{_e_pstn_trunk}); $HoH{$p} = \%rec; } else { next DOIT; } } elsif ($fields[0] = /^START/) { @rec{@attrs_sta} = @fields[0,1,11,15,28,29,31,5 +3]; if ($rec{_i_pstn_trunk}) { $p = extract($rec{_i_pstn_circuit}, $rec{_i_p +stn_trunk}); $HoH{$p} = \%rec; } elsif ($rec{_e_pstn_trunk}) { $p = extract($rec{_e_pstn_circuit}, $rec{_e_ +pstn_trunk}); $HoH{$p} = \%rec; } else { next DOIT; } } elsif ($fields[0] =~ /^ATTEMPT/) { @rec{@attrs_att} = @fields[0,1,11,13,17,30,31,3 +3,57]; if ($rec{_i_pstn_trunk}) { $p = extract($rec{_i_pstn_circuit}, $rec{_i_ +pstn_trunk}); $HoH{$p} = \%rec; } elsif ($rec{_e_pstn_trunk}) { $p = extract($rec{_e_pstn_circuit}, $rec{_e_ +pstn_trunk}); $HoH{$p} = \%rec; } else { next DOIT; } } else { next DOIT; } push @data, \%HoH; } }

        I don't know how well the optimizer works, but if you really need the last milklisecond, don't get the fields out, but keep working with the reference. Also drop the chomp. You're not working with C<$_> anymore.

        DOIT: while (my $it = $csv->getline ($io)) { my (%rec, %HoH, $p); # chomp; # No need to chomp anymore! $t_counter++; # my @fields = @$it; # Don't make a copy if ($it->[0] =~ /^STOP/) { # use the reference : } elsif ($it->[0] ...

        Enjoy, Have FUN! H.Merijn

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://668892]
Approved by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (7)
As of 2024-04-23 14:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found