Oh no, sorry! That was a "duh" moment from me! :)
How should the script be modified to use the original IDs?
I did add a few lines to your script, to reflect the possibility of Ns in the input, and those lines seem to work. I also just discovered that there appear to be some OTHER characters apart from A/C/G/T/N. The OTHERS could be one of R/Y/S/W/K/M/B/D/H/V. What is your advice for how to account for these? I hard coded those changes in a modified version of your script, as shown below, does it look alright?
Importantly, what is the part of your script that generates randomness in the sequence? Is it the map(letter(), 1 .. $count)?
I ask because of 2 reasons:
1. The goal is to randomly shuffle DNA, which is the obvious reason.
2. So that I can be sure that any two outputs with the same input are going to be different, due to this randomness
Many thanks!
#!/usr/bin/perl
#tybalt89_DNAfreq_Random_Generator.pl
# https://perlmonks.org/?node_id=1228191
use strict;
use warnings;
my $window = 1e6;
my $A = my $C = my $G = my $T = my $N = my $all = 0;
my $R=0; my $Y=0; my $S=0; my $W=0; my $K=0; my $M=0; my $B=0; my $D=0
+; my $H=0; my $V=0;
my (@sizes, $tmp, $start);
my $in = shift @ARGV;
my $out = shift @ARGV;
# print $in, "\n";
# print $out, "\n";
open IN, '<', $in or die "$! opening $in";
open OUT, '>', $out or die "$! opening $out";
sub letter
{
my $n = int rand $all--;
$n < $A ? ($A--, return 'A') :
$n < $A + $C ? ($C--, return 'C') :
$n < $A + $C + $G ? ($G--, return 'G') :
$n < $A + $C + $G + $T ? ($T--, return 'T') :
$n < $A + $C + $G + $T + $R ? ($R--, return 'R') :
$n < $A + $C + $G + $T + $R + $Y ? ($Y--, return 'Y') :
$n < $A + $C + $G + $T + $R + $Y + $S ? ($S--, return 'S') :
$n < $A + $C + $G + $T + $R + $Y + $S + $W ? ($W--, return 'W') :
$n < $A + $C + $G + $T + $R + $Y + $S + $W + $K ? ($K--, return 'K
+') :
$n < $A + $C + $G + $T + $R + $Y + $S + $W + $K + $M ? ($M--, retu
+rn 'M') :
$n < $A + $C + $G + $T + $R + $Y + $S + $W + $K + $M + $B ? ($B--,
+ return 'B') :
$n < $A + $C + $G + $T + $R + $Y + $S + $W + $K + $M + $B + $D ? (
+$D--, return 'D') :
$n < $A + $C + $G + $T + $R + $Y + $S + $W + $K + $M + $B + $D + $
+H ? ($H--, return 'H') :
$n < $A + $C + $G + $T + $R + $Y + $S + $W + $K + $M + $B + $D + $
+H + $V ? ($V--, return 'V') :
return 'N';
}
sub output
{
for my $count ( @sizes )
{
# print ">ID", $start++, "\n", map(letter(), 1 .. $count), "\n";
print OUT ">ID", $start++, "\n", map(letter(), 1 .. $count), "\n";
}
@sizes = ();
}
while( <IN> )
{
if( /^>/ )
{
$start //= s/\D+//gr;
}
elsif( /^[ACGTRYSWKMBDHVN]/ )
{
$A += tr/A//;
$C += tr/C//;
$G += tr/G//;
$T += tr/T//;
$R += tr/R//;
$Y += tr/Y//;
$S += tr/S//;
$W += tr/W//;
$K += tr/K//;
$M += tr/M//;
$B += tr/B//;
$D += tr/D//;
$H += tr/H//;
$V += tr/V//;
$N += tr/N//;
$all += $tmp = tr/ACGTRYSWKMBDHVN//;
push @sizes, $tmp;
$all >= $window and output();
}
}
$all and output();
close IN;
close OUT;