http://qs321.pair.com?node_id=212830

This should perhaps go ino a different section - and if a sufficiently blessed personage decides to move if then I have no problem with that

I have been thinking about the famed Schwartzian Transform as it is coming in extremely handy for some stuff I want to do. However, as seems to have been noted elsewhere*, one problem with the ST is that it is not exactly memory efficient in that you end up with a number of temporary copies of the array floating around requiring garbage collection. This is merely inconvenient for small arrays and small elements - but if you are sorting (as I am) large arrays of large complex structures (the results of XML::Simpled:XMLin() if you must know), then this can cause exciting disk swaps (and since I'm testing this on a windows PC even more exciting crashes when something fails to handle failed malloc()s properly) which rather defeats the point of the ST.

Once upon a time, I learned APL - a language which could perhaps be described as perl-like only terser - and APL has a built in array sort mchanism. The interesting thing about APL's sort routine (⍋ or ⍒ aka Del(ta) Stile) is that it returns you an array of indices into the original array rather than the sorted array itself. Thus, instead of doing (in perl notation)

my @sortedarray = sort @unsortedarray;
you actually do
my @sortedarray = @unsortedarray[sort @unsortedarray];
It occured to me that this idea could be used in the ST so that what gets passed around in the map{} sort{} map{} is in fact [$sort_target, $index] rather than [$sort_target, $entire_element] as this is much much smaller.

The code to do this is below and seems to both use less memory and be slightly quicker although I have yet to run this on a truly huge set of data

my @sorted = @unsorted[ map {$$_[1]} sort {$$b[0]<=>$$a[0]} map [expensive($unsorted[$_]), $_]; } 0..$#unsorted ];
Additional thought: (Inspired by diotalevi) There are probably a number of cases where you actually want to make further manipulations to the index to the data rather than the data itself. In such cases, the  @unsorted[ ] may be omitted.

Comments, (constructive) abuse, praise etc. welcome...

Dingus
*There is almost certainly a better discussion - (update) some others are noted in the replies below
(update misc typos fixed)


Enter any 47-digit prime number to continue.

Replies are listed 'Best First'.
Re: An APL trick for the Schwartzian Transform
by diotalevi (Canon) on Nov 14, 2002 at 14:39 UTC

    Neat. Only very rarely have I actually cared about perl's memory usage and probably all five times it's been for a very large sorting job. (I define large as being four times the size of memory, in this case two one gig text files). I've often found Knuth's The Art of Computer Programming vol 2, Sorting and Searching to be an excellent source of ideas when the common sort() isn't good enough. I recommend it as an excellent resource for ideas.

    For actual solutions I normally look for things like a disk-base db file (usually BerkeleyDB) for an external sort or some variation on partitioning and/or radix sort. My favorite and least effort trick when sorting voting records from the Secretary of State (by precinct and by voter id) just involves doing a combination of a radix and external sort: find a key field (I choose voting precinct) that can partition the data into appropriate sized bins and do just insert into the appropriate files (one per precinct).

    Once all the files are complete I do an in-memory plain-jane sort() on the data and write it back out over the previously unsorted partition. At this point I can either stop and decide to access the data from the individual partitions as-is or just cat(1) the files together and recreate a full (approximately 1 gig) text file. That's only a scetch but there are plenty of other great ideas in just the sorting section of TAoCP. It's worth checking out.

    Pre-publication update: broquaint pointed me to the excellent node resorting to sorting which does a nice overview of some common sorts.

    __SIG__ use B; printf "You are here %08x\n", unpack "L!", unpack "P4", pack "L!", B::svref_2object(sub{})->OUTSIDE;
Re: An APL trick for the Schwartzian Transform
by VSarkiss (Monsignor) on Nov 14, 2002 at 16:18 UTC

    No discussion of making ST more efficient would be complete without mentioning the Guttman-Rosler Transform. GRT is similar to your idea, except it modifies the representation of the data (using pack and unpack) to use less memory or to make the sort simpler.

Re: An APL trick for the Schwartzian Transform
by petral (Curate) on Nov 14, 2002 at 19:30 UTC
    Here's another discussion: Re: Substring Sort.  (Even though it's about speed, it still suggests alternatives similar to yours (which I like, btw).)

      p
Re: An APL trick for the Schwartzian Transform
by John M. Dlugosz (Monsignor) on Nov 14, 2002 at 19:46 UTC
    I wonder if in Perl6 it will be more practical to attach the cached sort key to the item using a property, instead of making an array of tuples. That would prevent making copies of the data.

    I also think that the behavior of ⍋ (\N{APL FUNCTIONAL SYMBOL DEL STYLE}, if that's not in your font) is somehow more primitive, and this ought to be available directly. More variations of sorting could be made using this than around our existing sort function.

    —John

Re: An APL trick for the Schwartzian Transform
by Aristotle (Chancellor) on Nov 16, 2002 at 02:28 UTC

    Actually, if you’re at that point, you can make things a whole lot more efficient:

    my @sorted = @unsorted[ do { my @cache = map key_from( $_ ), @unsorted; sort { $cache[$b] <=> $cache[$a] } 0 .. $#cache } ];

    The Schwartzian Transform suffers severely from creating a huge number of tiny anonymous arrays. This form creates only one extra array and one extra list. The larger the array and the less expensive the key calculation routine, the larger the impact this will have.

    Makeshifts last the longest.