Re^2: Rosetta Code: Long List is Long (Updated Solutions)

Replies are listed 'Best First'.

Re^3: Rosetta Code: Long List is Long (Updated Solutions - dualvar)
by eyepopslikeamosquito (Archbishop) on Dec 05, 2022 at 22:28 UTC

I noticed it, but (wrongly) assumed applying its remarkable two sort trick to my original solution would have the same effect.

For completeness, I created llil2d.pl (shown below) by applying your dualvar array trick to my original llil2.pl two-sort solution, with minimal changes. I can confirm that it is indeed about 3 seconds faster and with slightly lower memory use. Despite using Perl for 20 years, I'd never heard of dualvar before (update: oops, turns out I had :). Huge kudos to marioroy for unearthing this!

llil2d start
get_properties : 11 secs
sort + output  : 22 secs
total          : 33 secs
Memory use (Windows Private Bytes): 2,824,184K
(slightly lower than 2,896,104K for llil2.pl)
[download]

For completeness, here is my adjusted llil2d.pl:

# llil2d.pl. Remarkable dualvar version based on [marioroy]'s concocti
+on.
# Example run: perl llil2d.pl tt1.txt tt2.txt tt3.txt >out.txt

use strict;
use warnings;
use feature qw{say};
use Scalar::Util qw{dualvar};

# --------------------------------------------------------------------
+--
# LLiL specification
# ------------------
# A LLiL-format file is a text file.
# Each line consists of a lowercase name a TAB character and a non-neg
+ative integer count.
# That is, each line must match : ^[a-z]+\t\d+$
# For example, reading the LLiL-format files, tt1.txt containing:
#   camel\t42
#   pearl\t94
#   dromedary\t69
# and tt2.txt containing:
#   camel\t8
#   hello\t12345
#   dromedary\t1
# returns this hashref:
#   $hash_ret{"camel"}     = 50
#   $hash_ret{"dromedary"} = 70
#   $hash_ret{"hello"}     = 12345
#   $hash_ret{"pearl"}     = 94
# That is, values are added for items with the same key.
#
# To get the required LLiL text, you must sort the returned hashref
# descending by value and insert a TAB separator:
#   hello\t12345
#   pearl\t94
#   dromedary\t70
#   camel\t50
# To make testing via diff easier, we further sort ascending by name
# for lines with the same value.
# --------------------------------------------------------------------
+--

# Function get_properties
# Read a list of LLiL-format files
# Return a reference to a hash of properties
sub get_properties
{
   my $files = shift;    # in:  reference to a list of LLiL-format fil
+es
   my %hash_ret;         # out: reference to a hash of properties
   for my $fname ( @{$files} ) {
      open( my $fh, '<', $fname ) or die "error: open '$fname': $!";
      while (<$fh>) {
         chomp;
         my ($word, $count) = split /\t/;
         $hash_ret{$word} += $count;
      }
      close($fh) or die "error: close '$fname': $!";
   }
   return \%hash_ret;
}

# ----------------- mainline -----------------------------------------
+--

@ARGV or die "usage: $0 file...\n";
my @llil_files = @ARGV;

warn "llil2d start\n";
my $tstart1 = time;
my $href    = get_properties( \@llil_files );
my $tend1   = time;
my $taken1  = $tend1 - $tstart1;
warn "get_properties : $taken1 secs\n";

my $tstart2 = time;

my @data;
while ( my ($k, $v) = each %{$href} ) { push @data, dualvar($v, $k) }

# Using two sorts is waaay faster than one! (see [id://11148545])
for my $key ( sort { $b <=> $a } sort @data ) {
   say "$key\t" . (0 + $key);
}
my $tend2  = time;
my $taken2 = $tend2 - $tstart2;
my $taken  = $tend2 - $tstart1;

warn "sort + output  : $taken2 secs\n";
warn "total          : $taken secs\n";
[download]

Update: llil2grt.pl is about three seconds faster than llil2d.pl above, while using slightly less memory.

References Added Later

Dualvar:

Schizophrenic var by bliako
ListUtil.xs (including dualvar source code)
Create your own dualvars by brian_d_foy at The Effective Perler

Some ideas to try in the future:

salva super-fast CPAN Sort modules: Sort::Packed, Sort::Key, Sort::Key::Radix, Sort::Key::Multi
Judy arrays (e.g. see Re^3: Unsigned 64-bit integer as Judy key and Re^2: Rosetta Code: Long List is Long)
Simple Perl versions using built-in Perl sort idioms especially GRT (this is now done)


Think about Loose Coupling
	PerlMonks