robsv has asked for the wisdom of the Perl Monks concerning the following question:
I am calling an external routine which returns a string containing 2, 3, or 4 letters in the order in which they are read. I need to sort this string before outputting it. For example, if $bases = 'GCT', I need to change $bases to 'CGT' (the fine print: I'm playing with DNA, so the alphabet is 'ATCGN').
I'm currently doing this: $bases = join '',sort split('',$bases);
...which seems like a bit of overkill if the string will always be 2-4 characters. Since There's More Than One Way To Do It, I was wondering what other ways there were to do it. (This isn't meant to be a Golf question, but golfers are welcome!)
- robsv
Re: Sorting characters within a string
by kjherron (Pilgrim) on Aug 24, 2001 at 04:13 UTC
|
I can think of a couple other ways to do it, but they're both worse than yours unless you're having performance problems:
1) Generate every possible string and its sorted version, storing them in a hash with the unsorted string as the key & the sorted string as the value. There's only, what, 45 possible strings? That's doable.
2) Split the string into characters, count the number of each character, then output the characters in order based on the counts. This is O(n) so it'd be a win if your strings were really long, but it's just overkill for these short strings.
If performance is a problem, a fairly painless thing to do is cache the sorted strings as you calculate them:
if (!exists $sort_cache{$bases}) {
$sort_cache{$bases} = join( '', sort split('', $bases));
}
return $sort_cache{$bases};
This is of course just a lazy variant on #1 above. | [reply] [d/l] |
|
I think you're right that it would be more work initially
for the "all possibilites" hash. I'm no
mathematician/statistican but I think there are more like
5! = 120 possibilities (and I certainly wouldn't want to
build that hash by hand). Tilly, you're a mathematician.
What are the correct number of possibilities?
Building the hash programmatically would be an interesting
brain teaser.
Update: This was assuming string lengths of up to 5.
If the code and the comments disagree, then both are probably wrong. -- Norm Schryer
| [reply] |
|
{
my @c = qw(A T C G N);
my @strings = @c;
foreach (1..5) {
foreach (@strings) {
$sorted_str{$string} = join '', sort, split //;
}
@strings = map {
my $string = $_;
map $string.$_, @c;
} @strings;
}
}
Note that the nested map will be much slower than you
think if you are pre 5.6.1. Personally I would be
inclined to use the Orcish (for "Or Cache") maneuver for
this:
$bases = $sorted{$bases}
||= join '', sort, split //, $bases;
| [reply] [d/l] [select] |
|
Building the hash programmatically would be ani nteresting brain teaser.
Here is the worst way to do it:
my @strings = (grep /[acgmt]{2}/, ('aa' .. 'tt'),
grep /[acgmt]{3}/, ('aaa' .. 'ttt'),
grep /[acgmt]{4}/, ('aaaa' .. 'tttt'));
my %sort_cache;
for my $key (@strings) {
$sort_cache{$key} = join '',sort split('',$key);
}
Hey, don't take this seriously ;-) it does the job but it's so inefficient it's scary. Guillaume | [reply] [d/l] |
|
|
Boy, I really suck at this. One more try assuming strings
of length 2-4:
length of 2: 5 . 4 = 20
length of 3: 5 . 4 . 3 = 60
length of 4: 5 . 4 . 3 . 2 = 120
total of 20 + 60 + 120 = 200 possibilities.
If the code and the comments disagree, then both are probably wrong. -- Norm Schryer
| [reply] |
|
#!/usr/bin/perl
use strict;
use warnings;
my(%pp);
my(@acgnt)=( ' ', 'A', 'C', 'G', 'N', 'T' );
my($i);
for($i=11;$i<100000;$i++)
{
my($s, $o, @s);
while($i =~ /6/)
{
$o=index(reverse($i),'6');
$i+=5*10**$o;
}
$s=sprintf "%04d", $i;
@s=split('',$s);
@s = map { $acgnt[$_] } @s;
$s=join('', @s);
$s =~ y/ //d;
$pp{$s}=join('', sort(@s));
}
#print out the lookup table (not really part of the initializer)
my($k, $v);
while(($k,$v)=each %pp)
{
print "$k = $v\n";
}
This creates a complete list of inputs you could obtain and builds a hash with the outputs you want to display.
It does this fairly quickly and would only have to be done at startup time and then your print statement would bacically be print "$pp{$_}\n";
This could be made into an initializer function or the values could be computed and saved out and then read in for execution of the real program.
| [reply] [d/l] |
Re: Sorting characters within a string
by clintp (Curate) on Aug 24, 2001 at 05:06 UTC
|
If we're going for raw speed: don't use perl. :)
If I were doing this in assembly, and I wanted raw speed I'd:
- Generate all of the possible combinations and their sorted values, like so: AA => AA, AB => AB, BA => AB, AC => AC, CA => AC.
- Generate code (don't write it by hand!) that does something along the lines of pseudocode which works for aa, ab, ba, and bb:
if (substr($base,0,1) eq 'a') {
if (substr($base,1,1) eq 'a') {
return 'aa';
}
if (substr($base,1,1) eq 'b') {
return 'ab'
}
}
if (substr($base,0,1) eq 'b') {
if (substr($base,1,1) eq 'a') {
return 'ab'
}
if (substr($base,1,1) eq 'b') {
return 'bb'
}
}
Which means that for any possible code of length n and an alphabet length q there's only n*q possible comparison/jumps to be made at worst case. (AGCA would be translated to AACG using only 7 comparisons and jumps total for example.)
I'm fairly confident that this would outperform any solution using a hash or a split/join/sort. At least, in assembler. I'm just a little too harried to write code to prove that it might be faster in Perl. | [reply] [d/l] |
|
|