Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

sort behavior explanation required

by ghosh123 (Monk)
on Apr 20, 2014 at 15:02 UTC ( #1082944=perlquestion: print w/replies, xml ) Need Help??

ghosh123 has asked for the wisdom of the Perl Monks concerning the following question:

When I have an array which comprises both strings and integers , then applying the sort function in below mentioned ways , gives different output. Could anyone please explain me on what basis the sorting is happening in the following two cases :

@arr = ("jack", 80, "martin", 3, "allan", 'george'); @sort = sort { $a cmp $b } @arr; print "@sort \n";

output :
3 80 allan george jack martin

If the sort function is changed to

@sort = sort { $a <=> $b } @arr ;

then the output changes to
jack martin allan george 3 80

Replies are listed 'Best First'.
Re: sort behavior explanation required
by davido (Cardinal) on Apr 20, 2014 at 16:13 UTC

    In the first case you're doing stringwise comparisons (cmp does this), and the default behavior for those types of comparisons is ASCII or Code Point order.

    In the second case you are doing numeric comparisons. Perl happily obliges by trying to convert every string to a number. So you need to come to an understanding of how Perl converts strings to numbers. This is described in detail in perldata. But the simplest way to think about it is that a string that starts with numeric digits optionally preceded with a sign or decimal point will be converted to a non-zero number. Any other string will be converted to 0 (a small oversimplification, but reasonably close).

    So in your case, "jack", "martin", "allan", and "george" are all converted to the numeric value "0", while 80 and 3 are taken for their inherent values. That leaves only one question; what happens when several elements that all evaluate to numeric zero are sorted? Internally, modern Perls use a Merge Sort algorithm, which is a "stable stort." Stable sorts maintain original order of elements that have equal values. So your strings "jack", "martin", "allan", and "george" retain their original order. And since they all evaluate to zero, the actual non-zero numbers come after them.

    Perl hasn't always used the merge sort internally. Older Perl versions (5.6 and earlier, perhaps, but I don't remember for sure) used Quicksort. Quicksort had two disadvantages though. The first, and most commonly discussed is its propensity for going quadratic in computational time given some pathological inputs (O(n^2)), while still exhibiting an average case of O(n log n). Merge sort is guaranteed of a worst-case of O(n log n). But the other important difference (and disadvantage to QuickSort) is that it is not "stable", meaning that the order of equal-valued elements is not preserved. "jack", "martin", "allan", and "george" could end up in some other order, which is very difficult to predict ahead of time. The one advantage to Quicksort is that there is a little less work done on each iteration, so runtime on average can be a few cycles less. But that is usually so insignificant as to not really matter, and certainly not worth dealing with the downsides of possible O(n^2) time complexity for some data sets, and non-stable sorting of equivalent values.


    Dave

      To illustrate how the difference between quick sort and merge sort can be huge, this is a small benchmark of the two algorithms implemented in pure Perl, sorting 5000 integers 80 times.

      With an array in which the numbers have been generated with the random function:

      $ perl bench_sort.pl Benchmark: timing 80 iterations of merge, quick... merge: 3 wallclock secs ( 3.76 usr + 0.00 sys = 3.76 CPU) @ 21 +.28/s (n=80) quick: 2 wallclock secs ( 1.53 usr + 0.00 sys = 1.53 CPU) @ 52 +.32/s (n=80)
      So, with random input, my own pure Perl implementation of quick sort is more than twice faster than my implementation of merge sort.

      Now, lo and behold, see what happens if I run the same algorithms on an already sorted array:

      $ perl bench_sort.pl Benchmark: timing 80 iterations of merge, quick... merge: 3 wallclock secs ( 3.40 usr + 0.00 sys = 3.40 CPU) @ 23 +.53/s (n=80) quick: 199 wallclock secs (198.93 usr + 0.12 sys = 199.06 CPU) @ + 0.40/s (n=80)
      Quick sort is now 58 times slower than merge sort, and about 130 times slower than on randomly generated data.

      The same happens if the data is in reverse order:

      $ perl bench_sort.pl Benchmark: timing 80 iterations of merge, quick... merge: 3 wallclock secs ( 3.56 usr + 0.00 sys = 3.56 CPU) @ 22 +.50/s (n=80) quick: 199 wallclock secs (197.97 usr + 0.06 sys = 198.03 CPU) @ + 0.40/s (n=80)

      Admittedly, such pathological behavior of quick sort is very rare with random data, but having to sort data that is almost sorted is a very common task in real life. A typical example is the Association of Tennis Professionals (ATP) rankings: new rankings are published once a week or so. From one week to the next there will usually be relatively small differences. So that if you take the ranking of last week, update the points of each player and do the sorting again, the data is almost sorted. Quick sort would be very bad for that (actually bubble sort would do much better in such a situation). The same would happen when recalculating the index of a database table after having done a few additions or deletions. It has been suggested that, for quick sort to be efficient, one need to have a first phase of randomization of the array to be sorted.

        As always, the Devil lurks in the details. Can we see bench_sort.pl?

      And if you really want Quicksort, you can have it. See core pragma sort.

Re: sort behavior explanation required
by lidden (Curate) on Apr 20, 2014 at 15:06 UTC
    The second sort is numerical and when sorted numerical all those names are considered by perl to be zero.

    Edit: The first sort is sorted by the ascii value of the first char in the string and if that one is the same it looks at the next one.

Re: sort behavior explanation required
by AnomalousMonk (Bishop) on Apr 20, 2014 at 16:03 UTC

    Further to lidden's reply:
    Perl would have given you a clue to what was going on had you allowed it to by enabling warnings.

    c:\@Work\Perl\monks>perl -wMstrict -le "my @arr = ('jack', 80, 'martin', 3, 'allan', 'george'); ;; my @sort = sort { $a <=> $b } @arr; print qq{@sort}; " Argument "jack" isn't numeric in sort at -e line 1. Argument "martin" isn't numeric in sort at -e line 1. Argument "allan" isn't numeric in sort at -e line 1. Argument "george" isn't numeric in sort at -e line 1. jack martin allan george 3 80

    See also Lexicographical order for more info on the  cmp sort.

    Update: See also Scalar values in perldata. Also, consider the following:

    c:\@Work\Perl\monks>perl -wMstrict -le "my @ra = qw(foo 2bar 123baz quux999 x999x); for my $s ('', @ra) { no warnings 'numeric'; print qq{'$s' == }, 0+$s; } " '' == 0 'foo' == 0 '2bar' == 2 '123baz' == 123 'quux999' == 0 'x999x' == 0

Re: sort behavior explanation required
by Anonymous Monk on Apr 21, 2014 at 06:24 UTC

    Could anyone please explain me on what basis the sorting is happening in the following two cases

    The documentation can explain it if you read it :)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1082944]
Approved by lidden
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (1)
As of 2022-05-19 03:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you prefer to work remotely?



    Results (71 votes). Check out past polls.

    Notices?