Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Substring Sort

by Anonymous Monk
on Jun 01, 2001 at 00:29 UTC ( [id://84726]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I have an array of strings:
@origarray=('1|:|Test','2|:|Hash','3|:|Stuff','4|:|India');
I want to sort these strings by the names, ie. Test Hash India Stuff.
Is there an efficient way of doing this without creating an extra array to store these values in?
Dan

Replies are listed 'Best First'.
(tye)Re: Substring Sort
by tye (Sage) on Jun 01, 2001 at 01:26 UTC

    Well, since you said "efficient", I'll offer my favorite:

    @origarray= @origarray[ do { my @by= map {(split(/\|/,$_,3))[2]} @origarray; sort {$by[$a] cmp $by[$b]} 0..$#origarray } ];
    The above should be about as fast as you can get (though none of that matters unless you are sorting a huge number of items).

    The "obvious" other two choices involve repeatedly extracting the substring to be compared (can be much slower but requires the least amount of memory), or using the Schwartzian Transform which does something similar to my method except it makes a ton of very small arrays w/o names instead of one array with a name (falls between the other two methods in speed but requires the most memory).

    Update 2: Thanks to arhuman for reminding me of "my other favorite sorting scheme". His method is harder for me to compare. I suspect it may be faster and require only slightly more memory than mine above, at least for some cases. In any case, it is at least close.

    I tend to use it when sorting on multiple fields, especially if they constitute most of the data. I use my method above to simultaneously sort multiple related lists (since none of the other methods can do that) -- though you have to save the list returned by sort to do that.

    Updated to add missing ')'.

            - tye (but my friends call me "Tye")
(jeffa) Re: Substring Sort
by jeffa (Bishop) on Jun 01, 2001 at 00:40 UTC
    Well, this doens't pass the 'extra array' requirement, but if your data set is small enough, use a Schwartzian Transform:
    my @sorted = map { $_->[0] } sort { $a->[1] cmp $b->[1] } map { my ($n,$t) = split('|:|',$_,2); [$_,$t] } @origarray; # change this to sort by lower case: [$_,lc($t)]

    Jeff

    R-R-R--R-R-R--R-R-R--R-R-R--R-R-R--
    L-L--L-L--L-L--L-L--L-L--L-L--L-L--
    
      I can't test it nor benchmark it* but isn't the following code faster ?

      my @sorted = map { my ($n,$t) = split('|:|',$_,2); $t.'|:|'.$n } sort map { my ($n,$t) = split('|:|',$_,2); $t.'|:|'.$n } @origarray;

      The idea behind is that default sort is always faster according to this excellent article about Efficient Perl sorting
      (I couldn't cite this jewel too much...)

      * I'm at home and my Perl environment is curently down

      "Only Bad Coders Code Badly In Perl" (OBC2BIP)
        Careful there... When you use the default sort and have extra data appended to each string, you have to make sure the extra data doesn't affect the sort order.

        In this case, '|' (ASCII value 124) sorts after most other characters, so that code would sort a string before its prefixes. For example, 'string|extra data' comes after 'string plus|extra data', even though 'string' comes before 'string plus'.

        I think the easiest way to solve this problem is by sticking a null character between the sort string and the extra data. (If the data contains null characters already, it takes a bit more work.)

        @origarray = qw/ third|:|bb second|:|b fourth|:|c first|:|a /; my @sorted = map { /\0(.*)/s } sort map { (split /\|:\|/)[-1] . "\0$_" } @origarray; print "@sorted\n";
        That outer map could be done with substr() and index() instead of a regex.
      Eek, don't forget to escape the pipes in your split. Remember, just because there's quotes around it doesn't mean that it's not getting compiled as a regex.
         MeowChow                                   
                     s aamecha.s a..a\u$&owag.print
Re: Substring Sort
by MeowChow (Vicar) on Jun 01, 2001 at 02:24 UTC
    Boredom == Benchmarks.

    tye's routine wins overall, however, arhuman's does scale better with huge lists. I also tossed in a method of my own, based on tye's approach, which I was curious about. Not too shabby, but it scales rather poorly.

      When I see such a huge disparity in benchmarks, the first thing I think is "oops, implemented one of them wrong".

      I think the problem here is that myocom's sort is the only one where the return value from the subroutine is also the return value from sort, which means that when that function is called in a scalar context, sort ends up in a scalar context which causes it to just return undef and do no work.

      I added @_= to the subs passed to Benchmark and got these results:

        Thanks for pointing that out! I knew something was wrong with those numbers, but for the life of me, couldn't figure out what. Tossing in a print statement to check the return value obviously didn't help things :-)

        Regarding my sort routine, if you run my updated version (your scores look like the original) on a Perl 5.6.1, you should find that it's the fastest until about 10,000 elements, at which point arhuman's begins to take the lead:

        Comparing for 1000 elements Rate myo st tye ah mc myo 18.4/s -- -61% -72% -78% -83% st 46.9/s 154% -- -29% -43% -58% tye 66.5/s 261% 42% -- -20% -40% ah 82.7/s 348% 76% 24% -- -25% mc 111/s 500% 136% 66% 34% -- Comparing for 2500 elements Rate myo st tye ah mc myo 7.46/s -- -54% -69% -76% -78% st 16.3/s 118% -- -32% -47% -51% tye 24.0/s 222% 48% -- -22% -28% ah 31.0/s 315% 90% 29% -- -8% mc 33.5/s 349% 106% 39% 8% -- Comparing for 10000 elements Rate myo st tye mc ah myo 1.36/s -- -53% -68% -76% -78% st 2.88/s 112% -- -33% -49% -55% tye 4.29/s 214% 49% -- -25% -32% mc 5.69/s 317% 97% 33% -- -10% ah 6.34/s 365% 120% 48% 12% --
        Perl 5.6 gives rather different results though...
           MeowChow                                   
                       s aamecha.s a..a\u$&owag.print

      Correct me if I'm wrong, but I believe the original poster wanted to sort those strings by a substring (rather than just sorting the substring)? So while you folks are going about your fancy, big city mapping, you're throwing away part of the data he wanted. :-)

      Update: Say, they do get reconstructed, don't they. Still seems like an awful lot of code to get what he wanted, though...

      Update part deux: Woohoo! Thanks for adding me into the benchmark. :-)

        If you look at the code, you'll see that the original strings get reconstructed or re-referenced in each case.
           MeowChow                                   
                       s aamecha.s a..a\u$&owag.print
Re: Substring Sort
by Albannach (Monsignor) on Jun 01, 2001 at 00:50 UTC
    print join',',sort {substr($a,rindex($a,'|:|')) cmp substr($b,rindex($b,'|:|'))} @origarray" produces

    2|:|Hash,4|:|India,3|:|Stuff,1|:|Test

    --
    I'd like to be able to assign to an luser

Re: Substring Sort
by myocom (Deacon) on Jun 01, 2001 at 00:53 UTC

    Certainly. You can feed sort a block that tells it how to do the sorting:

    use strict; my @origarray=('1|:|Test','2|:|Hash','3|:|Stuff','4|:|India'); print join("\n",sort {(substr $a,4) cmp (substr $b,4)} @origarray);
Re: Substring Sort
by tachyon (Chancellor) on Jun 01, 2001 at 07:09 UTC

    All this complexity. This works fine, assign instead of print if you want.

    print sort map{/\|(\w+)$/}@origarray;

    tachyon

      That throws away the parts of the strings that we aren't comparing. The point is usually to sort the original strings, just using the substring for the purposes of comparing.

              - tye (but my friends call me "Tye")

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://84726]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (9)
As of 2024-04-18 11:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found