Substring Sort

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
(tye)Re: Substring Sort by tye (Sage) on Jun 01, 2001 at 01:26 UTC
Well, since you said "efficient", I'll offer my favorite: `@origarray= @origarray[ do { my @by= map {(split(/\\|/,$_,3))[2]} @origarray; sort {$by[$a] cmp $by[$b]} 0..$#origarray } ];` [download] The above should be about as fast as you can get (though none of that matters unless you are sorting a huge number of items). The "obvious" other two choices involve repeatedly extracting the substring to be compared (can be much slower but requires the least amount of memory), or using the Schwartzian Transform which does something similar to my method except it makes a ton of very small arrays w/o names instead of one array with a name (falls between the other two methods in speed but requires the most memory). Update 2: Thanks to arhuman for reminding me of "my other favorite sorting scheme". His method is harder for me to compare. I suspect it may be faster and require only slightly more memory than mine above, at least for some cases. In any case, it is at least close. I tend to use it when sorting on multiple fields, especially if they constitute most of the data. I use my method above to simultaneously sort multiple related lists (since none of the other methods can do that) -- though you have to save the list returned by sort to do that. Updated to add missing ')'. - tye (but my friends call me "Tye")	[reply] [d/l]
(jeffa) Re: Substring Sort by jeffa (Bishop) on Jun 01, 2001 at 00:40 UTC
Well, this doens't pass the 'extra array' requirement, but if your data set is small enough, use a Schwartzian Transform: `my @sorted = map { $_->[0] } sort { $a->[1] cmp $b->[1] } map { my ($n,$t) = split('\|:\|',$_,2); [$_,$t] } @origarray; # change this to sort by lower case: [$_,lc($t)]` [download] Jeff R-R-R--R-R-R--R-R-R--R-R-R--R-R-R-- L-L--L-L--L-L--L-L--L-L--L-L--L-L--	[reply] [d/l]
Re: (jeffa) Re: Substring Sort by arhuman (Vicar) on Jun 01, 2001 at 01:52 UTC
I can't test it nor benchmark it* but isn't the following code faster ? `my @sorted = map { my ($n,$t) = split('\|:\|',$_,2); $t.'\|:\|'.$n } sort map { my ($n,$t) = split('\|:\|',$_,2); $t.'\|:\|'.$n } @origarray;` [download] The idea behind is that default sort is always faster according to this excellent article about Efficient Perl sorting (I couldn't cite this jewel too much...) * I'm at home and my Perl environment is curently down "Only Bad Coders Code Badly In Perl" (OBC2BIP)	[reply] [d/l]
Re: Re: (jeffa) Re: Substring Sort by chipmunk (Parson) on Jun 01, 2001 at 05:21 UTC
Careful there... When you use the default sort and have extra data appended to each string, you have to make sure the extra data doesn't affect the sort order. In this case, '\|' (ASCII value 124) sorts after most other characters, so that code would sort a string before its prefixes. For example, 'string\|extra data' comes after 'string plus\|extra data', even though 'string' comes before 'string plus'. I think the easiest way to solve this problem is by sticking a null character between the sort string and the extra data. (If the data contains null characters already, it takes a bit more work.) `@origarray = qw/ third\|:\|bb second\|:\|b fourth\|:\|c first\|:\|a /; my @sorted = map { /\0(.*)/s } sort map { (split /\\|:\\|/)[-1] . "\0$_" } @origarray; print "@sorted\n";` [download] That outer map could be done with substr() and index() instead of a regex.	[reply] [d/l]
Re: (jeffa) Re: Substring Sort by MeowChow (Vicar) on Jun 01, 2001 at 02:08 UTC
Eek, don't forget to escape the pipes in your `split`. Remember, just because there's quotes around it doesn't mean that it's not getting compiled as a regex. MeowChow s aamecha.s a..a\u$&owag.print	[reply]
Re: Substring Sort by MeowChow (Vicar) on Jun 01, 2001 at 02:24 UTC
Boredom == Benchmarks. tye's routine wins overall, however, arhuman's does scale better with huge lists. I also tossed in a method of my own, based on tye's approach, which I was curious about. Not too shabby, but it scales rather poorly. Read more... (6 kB)	[reply] [d/l] [select]
(tye)Re3: Substring Sort by tye (Sage) on Jun 01, 2001 at 07:30 UTC
When I see such a huge disparity in benchmarks, the first thing I think is "oops, implemented one of them wrong". I think the problem here is that myocom's sort is the only one where the return value from the subroutine is also the return value from sort, which means that when that function is called in a scalar context, sort ends up in a scalar context which causes it to just return undef and do no work. I added `@_=` to the subs passed to Benchmark and got these results: Read more... (3 kB)	[reply] [d/l] [select]
Re: (tye)Re3: Substring Sort by MeowChow (Vicar) on Jun 01, 2001 at 08:58 UTC
Thanks for pointing that out! I knew something was wrong with those numbers, but for the life of me, couldn't figure out what. Tossing in a `print` statement to check the return value obviously didn't help things :-) Regarding my sort routine, if you run my updated version (your scores look like the original) on a Perl 5.6.1, you should find that it's the fastest until about 10,000 elements, at which point arhuman's begins to take the lead: Comparing for 1000 elements Rate myo st tye ah mc myo 18.4/s -- -61% -72% -78% -83% st 46.9/s 154% -- -29% -43% -58% tye 66.5/s 261% 42% -- -20% -40% ah 82.7/s 348% 76% 24% -- -25% mc 111/s 500% 136% 66% 34% -- Comparing for 2500 elements Rate myo st tye ah mc myo 7.46/s -- -54% -69% -76% -78% st 16.3/s 118% -- -32% -47% -51% tye 24.0/s 222% 48% -- -22% -28% ah 31.0/s 315% 90% 29% -- -8% mc 33.5/s 349% 106% 39% 8% -- Comparing for 10000 elements Rate myo st tye mc ah myo 1.36/s -- -53% -68% -76% -78% st 2.88/s 112% -- -33% -49% -55% tye 4.29/s 214% 49% -- -25% -32% mc 5.69/s 317% 97% 33% -- -10% ah 6.34/s 365% 120% 48% 12% -- [download] Perl 5.6 gives rather different results though... MeowChow s aamecha.s a..a\u$&owag.print	[reply] [d/l]
Re: Re: Substring Sort by myocom (Deacon) on Jun 01, 2001 at 03:22 UTC
Correct me if I'm wrong, but I believe the original poster wanted to sort those strings by a substring (rather than just sorting the substring)? So while you folks are going about your fancy, big city mapping, you're throwing away part of the data he wanted. :-) Update: Say, they do get reconstructed, don't they. Still seems like an awful lot of code to get what he wanted, though... Update part deux: Woohoo! Thanks for adding me into the benchmark. :-)	[reply]
Re: Re: Re: Substring Sort by MeowChow (Vicar) on Jun 01, 2001 at 03:25 UTC
If you look at the code, you'll see that the original strings get reconstructed or re-referenced in each case. MeowChow s aamecha.s a..a\u$&owag.print	[reply]
Re: Substring Sort by Albannach (Monsignor) on Jun 01, 2001 at 00:50 UTC
`print join',',sort {substr($a,rindex($a,'\|:\|')) cmp substr($b,rindex($b,'\|:\|'))} @origarray"` produces `2\|:\|Hash,4\|:\|India,3\|:\|Stuff,1\|:\|Test` -- I'd like to be able to assign to an luser	[reply] [d/l] [select]
Re: Substring Sort by myocom (Deacon) on Jun 01, 2001 at 00:53 UTC
Certainly. You can feed sort a block that tells it how to do the sorting: `use strict; my @origarray=('1\|:\|Test','2\|:\|Hash','3\|:\|Stuff','4\|:\|India'); print join("\n",sort {(substr $a,4) cmp (substr $b,4)} @origarray);` [download]	[reply] [d/l]
Re: Substring Sort by tachyon (Chancellor) on Jun 01, 2001 at 07:09 UTC
All this complexity. This works fine, assign instead of print if you want. `print sort map{/\\|(\w+)$/}@origarray;` [download] tachyon	[reply] [d/l]
(tye)Re4: Substring Sort by tye (Sage) on Jun 01, 2001 at 18:31 UTC
That throws away the parts of the strings that we aren't comparing. The point is usually to sort the original strings, just using the substring for the purposes of comparing. - tye (but my friends call me "Tye")	[reply]