Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

sort_top

by dada (Chaplain)
on Mar 08, 2004 at 12:07 UTC ( [id://334775]=CUFP: print w/replies, xml ) Need Help??

this is a small sub that can be used in situations where you have a large array to be sorted, but you really need only the first N elements of the sorted array.

you can think of it as something like the SQL SELECT * FROM foo ORDER BY bar LIMIT 10 (MySQL, Postgres) or SELECT TOP 10 * FROM foo ORDER BY bar (MSSQL).

of course, this has been thought of as an optimization, but to deserve the name of optimization it should really be implemented in C (XS/Inline/whatever you want). it could still perform quickier than a full sort, I suspect, particularly if your comparison function is very expensive.

SYNOPSIS

my @topten = sort_top(10, sub { $a <=> $b }, @huge_array);
sub sort_top { my($top, $func, @arr) = @_; my @top = sort $func @arr[0..$top-1]; for my $i ($top..$#arr) { my $x = -1; for (my $t = $#top; $t >= 0; $t--) { local ($a, $b) = ($arr[$i], $top[$t]); last if($func->() == 1); $x = $t; } if($x != -1) { splice(@top, $x, 0, $arr[$i]); pop(@top); } } return @top; }

Replies are listed 'Best First'.
sort_top explained
by dada (Chaplain) on Mar 09, 2004 at 11:51 UTC
    ok, I will do a little dissection of the code (which is rather terse as it is). first, let's put the line numbers for future reference:
    1: sub sort_top { 2: my($top, $func, @arr) = @_; 3: my @top = sort $func @arr[0..$top-1]; 4: for my $i ($top..$#arr) { 5: my $x = -1; 6: for (my $t = $#top; $t >= 0; $t--) { 7: local ($a, $b) = ($arr[$i], $top[$t]); 8: last if($func->() == 1); 9: $x = $t; 10: } 11: if($x != -1) { 12: splice(@top, $x, 0, $arr[$i]); 13: pop(@top); 14: } 15: } 16: return @top; 17: }
    the basic idea is very, very simple. I'll use actual numbers instead of variables, in the discussion below, to help focus on things concretely.

    let's say that I need the top 3 elements of a 10 elements array. if I sort 3 elements from the array, whatever elements they are, I have a "tentative" top list. now, I can do this optimization: if an element is greater (in sort terms) than my third one, it will never appear in my top 3 list, because it is at least fourth in sort order of the array (and this is the key: I don't care where that element goes, once I know that it doesn't show up in my top 3). so I need to compare each element with the third of my tentative list, and only if this comparation is in favour of the new element, compare it with the second one, and repeat for the first one.

    this is, more or less, the theory. and now let's see what the code does:

    line 3 is where the sub itself starts, and we start by sorting (traditionally :-) the first $top elements of the array.

    on line 4, we walk the rest of the array.

    line 5 initializes a flag that we will need to check if the element of the array need to be added to our @top list.

    line 6 setup a walking of the @top array in reverse order (from the last element to the first one).

    line 7 setup $a and $b for the comparison function (that's what sort does), by taking the current element from @arr and the current (in reverse order, remember) element from @top.

    line 8 calls the comparison function and exits the loop of line 6 (last) if the element from @arr comes after the element in @top (the return value of the comparison function is 1 if $a > $b, 0 if $a == $b or -1 if $a < $b). the 1 means that $a (the element from @arr) is greater (comes after) $b (the element from @top).

    line 9 saves the current index in @top in our flag variable, $x.

    line 11 checks the value of the flag: if it contains an index in @top, the substitution is made, otherwise the value is simply ignored.

    lines 12-13 performs the actual substitution: the value from @arr is inserted into @top at position $x (splice), and then the last element from @top (which now doesn't belong anymore to our top-list) is discarded (pop).

    finally, line 16 returns the top-list to our caller.

    I will also make a little example and try to follow, step by step, what it does. we start with this array:

    my @array = (8, 7, 1, 5, 4, 2, 10, 6, 9, 3);
    and we want the top 3 elements sorted numerically. so we grab the first 3 ones, and sort them:
    my @top = (1, 7, 8);
    now, the fourth element in the array is 5. we compare it with 8, and see that it is lower. we save the position (3) and compare it with 7. it is still lower, so we save the new position (2) and compare it with 1. now it is greater, but we have a position saved, which is 2, so we insert it in the second place: (1, 5, 7, 8) and we pop off the last element: (1, 5, 7).

    after 4 (which, with the same reasoning as above, goes in position 2), we have: (1, 4, 5, 7) which gets reduced to: (1, 4, 5).

    after 2 (same as above), we have: (1, 2, 4).

    now see what happens with 10: it is greater than our current last element (4), so we don't save any position, and simply move on to the next element.

    the same happens with 6 and 9. the last one (3), is lower than 4 (so we save position number 3) but greater than 2, so it becomes our new third position.

    and that's it, our top-list now contains: (1, 2, 3). this is what we return, and the world is happy :-)

    cheers,
    Aldo

    King of Laziness, Wizard of Impatience, Lord of Hubris

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: CUFP [id://334775]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (5)
As of 2024-03-29 15:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found