http://qs321.pair.com?node_id=961493

perl.j has asked for the wisdom of the Perl Monks concerning the following question:

So I have this text file:

Bob Joe Steve Joe Bob Mark Joe Steve Fred Fred Fred Joe

And I am reading and counting the names with this program:

use 5.12.4; use strict; use warnings; my %wordcount; open(FILE, "names.txt") or die "Could not open file: $!"; while(my $line = <FILE>) { chomp($line); my @words = split(' ', $line); foreach my $word(@words) { $wordcount{$word} += 1; } } sub hashDescend { $wordcount{$b} <=> $wordcount{$a}; } foreach my $key (sort hashDescend (keys(%wordcount)) { print "Word: $key \t\t\t\t\t\t Count: " . ($wordcount{$key}) . "\n +"; }

But now I want to change it up. I am going to be running this program in a huge repeating loop, so speed counts. So I want to make this faster.

I was thinking about making one big @globalarray and a @temparray and pushing from the @temparray into the global one. I am thinking this will make it a very tiny bit faster, but anything will do at this point. So how can I change my program as described, but still have the same output?

--perl.j

Replies are listed 'Best First'.
Re: Rewrite Program Using Arrays
by JavaFan (Canon) on Mar 25, 2012 at 16:39 UTC
    So I want to make this faster.
    Then don't sort. Really. The rest of your program is essentially linear. Sorting is N log N. That's the bottleneck of your program. Any speedup you'll do in the rest of the program are pointless if your data size increases: the sort will dominate.

      If he needs his output sorted though? Certainly the built-in sort won't be slower than using some external sort later?

      Then again, the n in sort's O(n) is the number of word forms which doesn't increase linearly with corpus size. Some law for it can probably be derived from Zipf's Law, I suspect it's something like log(corpus_size). So the sort should take an average of log(n)*log(log(n)) over corpus size which is not that bad.

      Thanks for the advice, but I just want to clear this up. I am not comfortable using the push and pop functions, so this is also going to be a learning experience. I understand that there are a lot of things I can do to speed the process, but I want to see what this will do still... The problem is, I don't know how to do it.
      --perl.j

        There is no use for push/pop here, though. This is a pretty straightforward problem: counting the occurrences of words in a file. That means putting the words in a hash as the keys and incrementing the value when a word is seen. That's what you've done, so you're already using the best algorithm. You can gain some speed by cutting out any unnecessary assignments, as I showed with the benchmarks in my other post; but you certainly won't gain anything by using a less straightforward algorithm or by shuffling your words around through extra arrays. There really aren't "a lot of things you can do to speed the process." Building a hash and sorting on the values is as fast as it's gonna get, unless you want to rewrite it in C.

        Aaron B.
        My Woefully Neglected Blog, where I occasionally mention Perl.

Re: Rewrite Program Using Arrays
by mbethke (Hermit) on Mar 25, 2012 at 16:28 UTC
    Pushing data around from one array to the other is very unlikely to make things faster, probably the opposite. One thing you could so is write foreach my $word (split $line) to avoid the extra array, but this is something I suspect the Perl compiler might optimize all by itself already, you have top try whether it makes a difference. hashDescend() as a simple code block would probably be slightly faster, but other than that there's little room for optimization as far as I can see. Is this really too slow for your job?
      I know it might not make it faster, but it's still a learning experience... Can you tell me how to do it?
      --perl.j

        Well, you could get rid of your loop by using map, but my benchmark shows no gain there. Eliminating some intermediate variables does speed things up, though. If you get rid of @words and $word completely, you'll gain some time, regardless of which looping method you use:

        #!/usr/bin/env perl use Modern::Perl; use Benchmark qw(:all); # create a line with words and spaces to split on my $line = ''; $line .= ('a', 'b', 'c', 'd', ' ')[rand(5)] for (1..1000); cmpthese( 100000, { 'loop with vars' => \&loopwithvars, 'loop sans vars' => \&loopsansvars, 'with map' => \&withmap, }); sub loopwithvars { my %wordcount; my @words = split ' ', $line; for my $word (@words){ $wordcount{$word}++; } } sub loopsansvars { my %wordcount; for (split ' ', $line){ $wordcount{$_}++; } } sub withmap { my %wordcount; map { $wordcount{$_}++ } split ' ', $line; } ###### results ###### Rate loop with vars with map loop sans vars loop with vars 11338/s -- -15% -17% with map 13369/s 18% -- -2% loop sans vars 13624/s 20% 2% --

        Aaron B.
        My Woefully Neglected Blog, where I occasionally mention Perl.

        Hm, I suppose you meant something like this:
        my @globalarray; foreach my $filename (@ARGV) { open my $fh, '<', $filename or die "$filename: $!"; my @temparray = <$fh>; push @globalarray, @temparray; close $fh; }
        Guess you could learn the use of "push", that's about the only thing. It's used to append a scalar or a whole list to the end of an array. "pop" does the opposite and takes one scalar off. shift and unshift work the same, just on the start of the array.
Re: Rewrite Program Using Arrays
by Anonymous Monk on Mar 25, 2012 at 16:26 UTC

    But now I want to change it up. I am going to be running this program in a huge repeating loop, so speed counts. So I want to make this faster.

    And do what, what will this new program do?

Re: Rewrite Program Using Arrays
by educated_foo (Vicar) on Mar 25, 2012 at 18:05 UTC
    Why did you put this useless bit of superstition at the top?
    use 5.12.4;
    One way to likely make your program faster would be to use an older Perl, like 5.6. But your superstition prevents that.

      How about to document the version of Perl on which he tested it?

        In that case I'd prefer to see that in the documentation or a comment. By "using" a version number you are not merely saying "I've tested this only with this particular version." What you are effectively saying is more like "I believe this will not work with anything older than X."

        If I see "use 5.12.4;" I assume there was something wrong with even 5.12.3, something that prevents the code to work correctly. So probably I would not bother trying to run it under 5.10.x. On the other hand with a comment about the version tested under, I would try it and only if I end up getting an error I'd take the possible version differences into account.

        Jenda
        Enoch was right!
        Enjoy the last years of Rome.

      say/state/switch, duh

        I don't think the OP is using any of them.