Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re^3: sort behavior explanation required

by AnomalousMonk (Archbishop)
on Apr 20, 2014 at 18:35 UTC ( [id://1082961]=note: print w/replies, xml ) Need Help??


in reply to Re^2: sort behavior explanation required
in thread sort behavior explanation required

As always, the Devil lurks in the details. Can we see bench_sort.pl?

Replies are listed 'Best First'.
Re^4: sort behavior explanation required
by Laurent_R (Canon) on Apr 20, 2014 at 19:00 UTC
Re^4: sort behavior explanation required
by Laurent_R (Canon) on Apr 21, 2014 at 22:42 UTC
    So did you see the devil? Any error somewhere? The code I posted is not optimal, it is part of a much larger project I had two years ago, with many other sorting algorithms being tested, hence double function calls which are not really needed here. But, asides from that, do you see anything wrong in what I am saying?

      I'm embarrassed to say that, after asking you to post your code and after you went to the trouble to do so, I haven't had a chance to give it the long, careful consideration it deserves.

      I have, however, given it a quick look. I haven't seen the Devil yet, but he is wily. A few things strike my eye.

      • There's no test against the built-in sort to check that your sorts are correct.
      • I would have been interested to see timing comparisons of the two different built-in sorts that are available, quick and merge, versus the random and pathological data sets you're using. See the sort pragma.
      • Are there any other pathological data sets to throw at these two sorts?
      I hope I will shortly be able to give this the time it merits.

        I have just tried the case where you have a lot of duplicates. For this, I have populated the array to be sorted this way (with $max being set at 4999):
        $w[$_] = int rand(20) foreach (0..$max);
        This is the result on two runs:
        $ perl bench_sort.pl Benchmark: timing 80 iterations of merge, quick... merge: 4 wallclock secs ( 3.78 usr + 0.00 sys = 3.78 CPU) @ 21 +.19/s (n=80) quick: 11 wallclock secs (10.70 usr + 0.00 sys = 10.70 CPU) @ 7 +.48/s (n=80) $ perl bench_sort.pl Benchmark: timing 80 iterations of merge, quick... merge: 4 wallclock secs ( 3.76 usr + 0.00 sys = 3.76 CPU) @ 21 +.28/s (n=80) quick: 10 wallclock secs (10.65 usr + 0.00 sys = 10.65 CPU) @ 7 +.51/s (n=80)
        And now, the same using the following array initialization:
        $w[$_] = int rand(10) foreach (0..$max);
        And the results:
        $ perl bench_sort.pl Benchmark: timing 80 iterations of merge, quick... merge: 4 wallclock secs ( 3.76 usr + 0.00 sys = 3.76 CPU) @ 21 +.28/s (n=80) quick: 22 wallclock secs (22.48 usr + 0.00 sys = 22.48 CPU) @ 3 +.56/s (n=80)
        So, this seems to be another pathological case for quick sort, but much less significant than the two ones I previously tested.

        BTW, testing this, I noticed that there was a small "off by one" mistake in the code I posted previously. To work really properly, the code needs to be changed this way:

        my $max = 4999; $w[$_] = int rand(10) foreach (0..$max);
        That led to an occasional "use of uninializeds value" warning, but does not make any of the benchmark conclusions invalid.

        There's no test against the built-in sort to check that your sorts are correct.

        True, but for this post, I reduced a much larger testing program to what was needed to make my case. If you look at my code, you will see commented out statements printing the result of the sort for each algorithm. I have tested many cases, and haven't seen any case with something incorrect. I do believe that my sort implementations are correct, possibly not optimal, but correct.

        I would have been interested to see timing comparisons of the two different built-in sorts that are available, quick and merge, versus the random and pathological data sets you're using.

        You are right again. I was thinking of doing it, but did not take the time. I might come up with that if I have the energy of doing it. It is almost midnight in my timezone now, I might give a try, but only if that does not take too long.

        Are there any other pathological data sets to throw at these two sorts?

        I do not know, the two cases I have shown are the most well-known cases and the only ones I have really tested (but, as I said in one of my posts above, they are much more common than one might think, because you quite often have to sort data that is already almost sorted). I *think* that an array where there are really many duplicates also show pathological behavior for quick sort, but I don't think I have checked.

        I would have been interested to see timing comparisons of the two different built-in sorts that are available, quick and merge, versus the random and pathological data sets you're using. See the sort pragma.

        Now I have done it. First, this is the code (for the quick sort case):

        use warnings; use sort '_qsort'; use Benchmark qw(timethese); my @x; my $max = 40000; $x[$_] = int rand(2e5) foreach (0..$max); my @w = 1..$max; my @z = reverse 1..$max; sub sort_random { my @v = sort {$a <=> $b} (@x); } sub sort_sorted { my @v = sort {$a <=> $b} (@w); } sub sort_reversed { my @v = sort {$a <=> $b} (@z); } timethese (800, { 'random' => \&sort_random, 'sorted' => \&sort_sorted, 'reverse' => \&sort_reversed, } );
        Of course for benchmarking significantly the much more efficient built-in sort functions, I used a larger array and more iterations. So don't compare the benchmark data produced here with the previous ones. OK, the result with the quick sort activated:
        $ perl bench_sort2.pl Benchmark: timing 800 iterations of random, reverse, sorted... random: 14 wallclock secs (14.48 usr + 0.00 sys = 14.48 CPU) @ 55 +.26/s (n=800) reverse: 14 wallclock secs (13.57 usr + 0.00 sys = 13.57 CPU) @ 58 +.94/s (n=800) sorted: 14 wallclock secs (13.57 usr + 0.00 sys = 13.57 CPU) @ 58 +.94/s (n=800)
        So no significant difference between the three cases. I think that the quick sort implementation now randomizes the order of the input data before actually starting the sort, in order to evacuate most pathological behaviors, so that there are no significant differences between sorted and unsorted arrays. I am almost sure I've read that somewhere, but I can't remember where. I wish I had somewhere a Perl 5.6 implementation running, we could probably see the difference, since I think this randomization of the input order was probably introduced with Perl 5.8.

        Now the same test with merge sort (commenting out the "use sort '_qsort';" line, no other change to the code).

        $ perl bench_sort2.pl Benchmark: timing 800 iterations of random, reverse, sorted... random: 11 wallclock secs (11.31 usr + 0.00 sys = 11.31 CPU) @ 70 +.73/s (n=800) reverse: 2 wallclock secs ( 1.98 usr + 0.00 sys = 1.98 CPU) @ 40 +3.84/s (n=800) sorted: 2 wallclock secs ( 1.94 usr + 0.00 sys = 1.94 CPU) @ 41 +3.44/s (n=800)
        With sorted and reversed sorted data, merge sort is doing much better than quick sort in general and than merge sort on random data. Alright, I will not try to make any further conclusion on these tests, I do not feel like going into the Perl interpreter C code to check what is really going on.

        Update Apr 25, 2014, 09:45 UTC: I knew I had read it somewhere. The official documentation on the sort pragma (sort) states: "In Perl 5.8 and later, quicksort defends against quadratic behaviour by shuffling large arrays before sorting."

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1082961]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (6)
As of 2024-04-23 07:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found