Re^3: sort behavior explanation required

Replies are listed 'Best First'.
Re^4: sort behavior explanation required by Laurent_R (Canon) on Apr 20, 2014 at 19:00 UTC
Sure you can. See below. Read more... (1199 Bytes)	[reply] [d/l]
Re^4: sort behavior explanation required by Laurent_R (Canon) on Apr 21, 2014 at 22:42 UTC
So did you see the devil? Any error somewhere? The code I posted is not optimal, it is part of a much larger project I had two years ago, with many other sorting algorithms being tested, hence double function calls which are not really needed here. But, asides from that, do you see anything wrong in what I am saying?	[reply]
Re^5: sort behavior explanation required by AnomalousMonk (Archbishop) on Apr 22, 2014 at 18:53 UTC
I'm embarrassed to say that, after asking you to post your code and after you went to the trouble to do so, I haven't had a chance to give it the long, careful consideration it deserves. I have, however, given it a quick look. I haven't seen the Devil yet, but he is wily. A few things strike my eye. There's no test against the built-in sort to check that your sorts are correct. I would have been interested to see timing comparisons of the two different built-in sorts that are available, quick and merge, versus the random and pathological data sets you're using. See the sort pragma. Are there any other pathological data sets to throw at these two sorts? I hope I will shortly be able to give this the time it merits.	[reply]
Re^6: sort behavior explanation required by Laurent_R (Canon) on Apr 22, 2014 at 22:29 UTC
I have just tried the case where you have a lot of duplicates. For this, I have populated the array to be sorted this way (with $max being set at 4999): `$w[$_] = int rand(20) foreach (0..$max);` [download] This is the result on two runs: `$ perl bench_sort.pl Benchmark: timing 80 iterations of merge, quick... merge: 4 wallclock secs ( 3.78 usr + 0.00 sys = 3.78 CPU) @ 21 +.19/s (n=80) quick: 11 wallclock secs (10.70 usr + 0.00 sys = 10.70 CPU) @ 7 +.48/s (n=80) $ perl bench_sort.pl Benchmark: timing 80 iterations of merge, quick... merge: 4 wallclock secs ( 3.76 usr + 0.00 sys = 3.76 CPU) @ 21 +.28/s (n=80) quick: 10 wallclock secs (10.65 usr + 0.00 sys = 10.65 CPU) @ 7 +.51/s (n=80)` [download] And now, the same using the following array initialization: `$w[$_] = int rand(10) foreach (0..$max);` [download] And the results: `$ perl bench_sort.pl Benchmark: timing 80 iterations of merge, quick... merge: 4 wallclock secs ( 3.76 usr + 0.00 sys = 3.76 CPU) @ 21 +.28/s (n=80) quick: 22 wallclock secs (22.48 usr + 0.00 sys = 22.48 CPU) @ 3 +.56/s (n=80)` [download] So, this seems to be another pathological case for quick sort, but much less significant than the two ones I previously tested. BTW, testing this, I noticed that there was a small "off by one" mistake in the code I posted previously. To work really properly, the code needs to be changed this way: `my $max = 4999; $w[$_] = int rand(10) foreach (0..$max);` [download] That led to an occasional "use of uninializeds value" warning, but does not make any of the benchmark conclusions invalid.	[reply] [d/l] [select]
Re^6: sort behavior explanation required by Laurent_R (Canon) on Apr 22, 2014 at 22:00 UTC
There's no test against the built-in sort to check that your sorts are correct. True, but for this post, I reduced a much larger testing program to what was needed to make my case. If you look at my code, you will see commented out statements printing the result of the sort for each algorithm. I have tested many cases, and haven't seen any case with something incorrect. I do believe that my sort implementations are correct, possibly not optimal, but correct. I would have been interested to see timing comparisons of the two different built-in sorts that are available, quick and merge, versus the random and pathological data sets you're using. You are right again. I was thinking of doing it, but did not take the time. I might come up with that if I have the energy of doing it. It is almost midnight in my timezone now, I might give a try, but only if that does not take too long. Are there any other pathological data sets to throw at these two sorts? I do not know, the two cases I have shown are the most well-known cases and the only ones I have really tested (but, as I said in one of my posts above, they are much more common than one might think, because you quite often have to sort data that is already almost sorted). I think that an array where there are really many duplicates also show pathological behavior for quick sort, but I don't think I have checked.	[reply]
Re^6: sort behavior explanation required by Laurent_R (Canon) on Apr 25, 2014 at 23:45 UTC
I would have been interested to see timing comparisons of the two different built-in sorts that are available, quick and merge, versus the random and pathological data sets you're using. See the sort pragma. Now I have done it. First, this is the code (for the quick sort case): `use warnings; use sort '_qsort'; use Benchmark qw(timethese); my @x; my $max = 40000; $x[$_] = int rand(2e5) foreach (0..$max); my @w = 1..$max; my @z = reverse 1..$max; sub sort_random { my @v = sort {$a <=> $b} (@x); } sub sort_sorted { my @v = sort {$a <=> $b} (@w); } sub sort_reversed { my @v = sort {$a <=> $b} (@z); } timethese (800, { 'random' => \&sort_random, 'sorted' => \&sort_sorted, 'reverse' => \&sort_reversed, } );` [download] Of course for benchmarking significantly the much more efficient built-in sort functions, I used a larger array and more iterations. So don't compare the benchmark data produced here with the previous ones. OK, the result with the quick sort activated: `$ perl bench_sort2.pl Benchmark: timing 800 iterations of random, reverse, sorted... random: 14 wallclock secs (14.48 usr + 0.00 sys = 14.48 CPU) @ 55 +.26/s (n=800) reverse: 14 wallclock secs (13.57 usr + 0.00 sys = 13.57 CPU) @ 58 +.94/s (n=800) sorted: 14 wallclock secs (13.57 usr + 0.00 sys = 13.57 CPU) @ 58 +.94/s (n=800)` [download] So no significant difference between the three cases. I think that the quick sort implementation now randomizes the order of the input data before actually starting the sort, in order to evacuate most pathological behaviors, so that there are no significant differences between sorted and unsorted arrays. I am almost sure I've read that somewhere, but I can't remember where. I wish I had somewhere a Perl 5.6 implementation running, we could probably see the difference, since I think this randomization of the input order was probably introduced with Perl 5.8. Now the same test with merge sort (commenting out the "use sort '_qsort';" line, no other change to the code). `$ perl bench_sort2.pl Benchmark: timing 800 iterations of random, reverse, sorted... random: 11 wallclock secs (11.31 usr + 0.00 sys = 11.31 CPU) @ 70 +.73/s (n=800) reverse: 2 wallclock secs ( 1.98 usr + 0.00 sys = 1.98 CPU) @ 40 +3.84/s (n=800) sorted: 2 wallclock secs ( 1.94 usr + 0.00 sys = 1.94 CPU) @ 41 +3.44/s (n=800)` [download] With sorted and reversed sorted data, merge sort is doing much better than quick sort in general and than merge sort on random data. Alright, I will not try to make any further conclusion on these tests, I do not feel like going into the Perl interpreter C code to check what is really going on. Update Apr 25, 2014, 09:45 UTC: I knew I had read it somewhere. The official documentation on the sort pragma (sort) states: "In Perl 5.8 and later, quicksort defends against quadratic behaviour by shuffling large arrays before sorting."	[reply] [d/l] [select]


Clear questions and runnable code get the best and fastest answer
	PerlMonks