Sorting characters within a string

robsv has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Sorting characters within a string by kjherron (Pilgrim) on Aug 24, 2001 at 04:13 UTC
I can think of a couple other ways to do it, but they're both worse than yours unless you're having performance problems: 1) Generate every possible string and its sorted version, storing them in a hash with the unsorted string as the key & the sorted string as the value. There's only, what, 45 possible strings? That's doable. 2) Split the string into characters, count the number of each character, then output the characters in order based on the counts. This is O(n) so it'd be a win if your strings were really long, but it's just overkill for these short strings. If performance is a problem, a fairly painless thing to do is cache the sorted strings as you calculate them: `if (!exists $sort_cache{$bases}) { $sort_cache{$bases} = join( '', sort split('', $bases)); } return $sort_cache{$bases};` [download] This is of course just a lazy variant on #1 above.	[reply] [d/l]
Re: Re: Sorting characters within a string by jlongino (Parson) on Aug 24, 2001 at 05:05 UTC
I think you're right that it would be more work initially for the "all possibilites" hash. I'm no mathematician/statistican but I think there are more like 5! = 120 possibilities (and I certainly wouldn't want to build that hash by hand). Tilly, you're a mathematician. What are the correct number of possibilities? Building the hash programmatically would be an interesting brain teaser. Update: This was assuming string lengths of up to 5. If the code and the comments disagree, then both* are probably wrong.* -- Norm Schryer	[reply]
Re (tilly) 3: Sorting characters within a string by tilly (Archbishop) on Aug 24, 2001 at 06:02 UTC
Are duplicates allowed? If so then the correct number for 1 is 5, for 2 is 55=25, for 3 is 555=125, and for 4 is 5555=625. For all strings of length 2-4 that comes out to a grand total of 775. Were I autogenerating, my approach might be as follows (untested): `{ my @c = qw(A T C G N); my @strings = @c; foreach (1..5) { foreach (@strings) { $sorted_str{$string} = join '', sort, split //; } @strings = map { my $string = $_; map $string.$_, @c; } @strings; } }` [download] Note that the nested map will be much slower than you think if you are pre 5.6.1. Personally I would be inclined to use the Orcish (for "Or Cache") maneuver for this: `$bases = $sorted{$bases} \|\|= join '', sort, split //, $bases;` [download]	[reply] [d/l] [select]
Re: Re: Re: Sorting characters within a string by guillaume (Pilgrim) on Aug 24, 2001 at 05:45 UTC
Building the hash programmatically would be ani nteresting brain teaser. Here is the worst way to do it: `my @strings = (grep /[acgmt]{2}/, ('aa' .. 'tt'), grep /[acgmt]{3}/, ('aaa' .. 'ttt'), grep /[acgmt]{4}/, ('aaaa' .. 'tttt')); my %sort_cache; for my $key (@strings) { $sort_cache{$key} = join '',sort split('',$key); }` [download] Hey, don't take this seriously ;-) it does the job but it's so inefficient it's scary. Guillaume	[reply] [d/l]
Re: Re: Re: Re: Sorting characters within a string by jlongino (Parson) on Aug 24, 2001 at 05:55 UTC
Re: Re: Re: Sorting characters within a string by jlongino (Parson) on Aug 24, 2001 at 05:33 UTC
Boy, I really suck at this. One more try assuming strings of length 2-4: length of 2: 5 . 4 = 20 length of 3: 5 . 4 . 3 = 60 length of 4: 5 . 4 . 3 . 2 = 120 total of 20 + 60 + 120 = 200 possibilities. If the code and the comments disagree, then both* are probably wrong.* -- Norm Schryer	[reply]
Re: Re: Sorting characters within a string by dga (Hermit) on Aug 24, 2001 at 23:42 UTC
Nice Idea to precompute the values. I got 3901 which represents the entire set of 2-4 letter long unsorted inputs in this alphabet. This of course folds to a very small number of sorted outcomes. Here is the code. `#!/usr/bin/perl use strict; use warnings; my(%pp); my(@acgnt)=( ' ', 'A', 'C', 'G', 'N', 'T' ); my($i); for($i=11;$i<100000;$i++) { my($s, $o, @s); while($i =~ /6/) { $o=index(reverse($i),'6'); $i+=510*$o; } $s=sprintf "%04d", $i; @s=split('',$s); @s = map { $acgnt[$_] } @s; $s=join('', @s); $s =~ y/ //d; $pp{$s}=join('', sort(@s)); } #print out the lookup table (not really part of the initializer) my($k, $v); while(($k,$v)=each %pp) { print "$k = $v\n"; }` [download] This creates a complete list of inputs you could obtain and builds a hash with the outputs you want to display. It does this fairly quickly and would only have to be done at startup time and then your print statement would bacically be print "$pp{$_}\n"; This could be made into an initializer function or the values could be computed and saved out and then read in for execution of the real program.	[reply] [d/l]
Re: Sorting characters within a string by clintp (Curate) on Aug 24, 2001 at 05:06 UTC
If we're going for raw speed: don't use perl. :) If I were doing this in assembly, and I wanted raw speed I'd: Generate all of the possible combinations and their sorted values, like so: AA => AA, AB => AB, BA => AB, AC => AC, CA => AC. Generate code (don't write it by hand!) that does something along the lines of pseudocode which works for aa, ab, ba, and bb: `if (substr($base,0,1) eq 'a') { if (substr($base,1,1) eq 'a') { return 'aa'; } if (substr($base,1,1) eq 'b') { return 'ab' } } if (substr($base,0,1) eq 'b') { if (substr($base,1,1) eq 'a') { return 'ab' } if (substr($base,1,1) eq 'b') { return 'bb' } }` [download] Which means that for any possible code of length n and an alphabet length q there's only n*q possible comparison/jumps to be made at worst case. (AGCA would be translated to AACG using only 7 comparisons and jumps total for example.) I'm fairly confident that this would outperform any solution using a hash or a split/join/sort. At least, in assembler. I'm just a little too harried to write code to prove that it might be faster in Perl.	[reply] [d/l]


P is for Practical
	PerlMonks