As mentioned in the root node update, I found a fatal flaw in the algorithm - essentially the necessary size of @queue does not scale as I expected. For some pathological cases, you need to store nearly the entire result array in order to maintain a correct result. I've created a version that outputs correctly by traversing contours of i + j = constant and caching 1/2*N*M results. Unfortunately, because the real speed benefit I was getting was from using an insertion sort on a fixed-length queue, this also kills my great performance. The code (with 1 print per sum):
sub solution_1 {
# queue solution
# O(2*N+M) memory, O(N^2*M) time
my ($list_ref1, $list_ref2) = @_;
my @list1;
my @list2;
if (@$list_ref1 <= @$list_ref2) {
@list1 = @$list_ref1;
@list2 = @$list_ref2;
} else {
@list1 = @$list_ref2;
@list2 = @$list_ref1;
}
my @queue = ( $list1[-1]+$list2[-1] );
for my $k (0 .. 2*$#list1) {
for my $i (0 .. $k) {
next if $i >= @list1;
my $j = $k - $i;
last if $j >= @list2;
print OUT (shift(@queue),"\n") if @queue >= 0.5*@list1*@li
+st2;
my $sum = $list1[$i]+$list2[$j];
my $count = 0;
$count++ until $sum <= $queue[$count];
splice @queue, $count, 0, $sum;
}
}
pop @queue;
print OUT "$_\n" for @queue;
}
And the benchmarks:
Benchmark: timing 100 iterations of Baseline, LR_1, LR_2, Queue...
Baseline: 0.555567 wallclock secs ( 0.55 usr + 0.00 sys = 0.55 CPU
+) @ 181.82/s (n=100)
(warning: too few iterations for a reliable count)
LR_1: 18.9476 wallclock secs (18.94 usr + 0.00 sys = 18.94 CPU)
+ @ 5.28/s (n=100)
LR_2: 70.0044 wallclock secs (70.00 usr + 0.00 sys = 70.00 CPU)
+ @ 1.43/s (n=100)
Queue: 132.26 wallclock secs (132.25 usr + 0.00 sys = 132.25 CPU
+) @ 0.76/s (n=100)
Rate Queue LR_2 LR_1 Baseline
Queue 0.756/s -- -47% -86% -100%
LR_2 1.43/s 89% -- -73% -99%
LR_1 5.28/s 598% 270% -- -97%
Baseline 182/s 23945% 12627% 3344% --
Benchmark: timing 100000 iterations of Baseline, LR_1, LR_2, Queue...
Baseline: 1.61376 wallclock secs ( 1.60 usr + 0.01 sys = 1.61 CPU)
+ @ 62111.80/s (n=100000)
LR_1: 7.19492 wallclock secs ( 7.19 usr + 0.01 sys = 7.20 CPU)
+ @ 13888.89/s (n=100000)
LR_2: 8.1213 wallclock secs ( 8.12 usr + 0.00 sys = 8.12 CPU)
+@ 12315.27/s (n=100000)
Queue: 4.26218 wallclock secs ( 4.26 usr + 0.00 sys = 4.26 CPU)
+ @ 23474.18/s (n=100000)
Rate LR_2 LR_1 Queue Baseline
LR_2 12315/s -- -11% -48% -80%
LR_1 13889/s 13% -- -41% -78%
Queue 23474/s 91% 69% -- -62%
Baseline 62112/s 404% 347% 165% --
|