Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re: Counting and Filtering Words From File

by tybalt89 (Monsignor)
on May 10, 2020 at 00:34 UTC ( [id://11116631]=note: print w/replies, xml ) Need Help??


in reply to Counting and Filtering Words From File

Try this. It does the lc() and the tr/// only once.

#!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11116620 use warnings; my @excluded = qw( a about although also an and another are as at be b +een before between but by can do during for from has how however in in +to is it many may more most etc ); local $/; my %count; $count{$_}++ for split ' ', (lc <>) =~ tr!-'@~,.()?*%/[]="!!dr; delete @count{@excluded}; print "$count{$_} $_\n" for sort { $count{ $b } <=> $count{ $a } || $a cmp $b } keys %count;

Replies are listed 'Best First'.
Re^2: Counting and Filtering Words From File
by hippo (Bishop) on May 10, 2020 at 10:42 UTC

    Independently I have arrived at a similar solution.

    #!/usr/bin/env perl use strict; use warnings; my %excluded = map { $_ => 1 } qw( a about although also an and another are as at be been before between but by can do during for from has how however in in +to is it many may more most etc ); my %count; { local $/ = ""; while (<>) { tr {A-Z':@~,.()?*%/[]="-}{a-z}d; foreach (split) { $count{$_}++ unless $excluded{$_}; } } } foreach my $word (sort { $count{$a} <=> $count{$b} or $a cmp $b } keys + %count) { print "$count{$word} $word\n"; }

    I've leveraged the requirement of only lowercasing the ascii letters by incorporating it into the tr/// and I've gone for paragraph mode instead of a single slurp, just in case :-)

    Both solutions run in similar times and about 100x faster than the original code:

    $ time ./11116620.pl < Frankenstein.txt > orig.out real 0m9.381s user 0m9.366s sys 0m0.007s $ time ./wordcount.pl < Frankenstein.txt > hippo.out real 0m0.089s user 0m0.081s sys 0m0.008s $ time ./tybalt.pl < Frankenstein.txt > tybalt.out real 0m0.090s user 0m0.084s sys 0m0.005s

    There are some minor differences between all three outputs but without a tighter spec these aren't overly concerning.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11116631]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (3)
As of 2024-04-26 07:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found