No leading whitespaces anywhere in input file. I just opened it with a text editor and checked
But I just remembered that real life DNA sequences also often have N in addition to A/T/G/C. these have to be accounted for as well, right?
However, in my input sequences, having Ns versus NOT having them did not appear to make a difference in my test runs using the following input example
>Chr1
CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAAT
>Chr2
TATGACGTTTAGGGACGATCTTAATGACGTTTAGGGTTTTATCGATCAGCGACGTAGGGA
>Chr3
GTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTT
>Chr4
AACAAGGTACTCTCATCTCTTTACTGGGTAAATAACATATCAACTTGGACCTCATTCATA
>Chr5
AACATGATTCACACCTTGATGATGTTTTTAGAGAGTTCTCGTGTGAGGCGATTCTTGAGG
It'd be strange if it worked for you, because it is not working for me! Output file is always empty - whether it be with large file size, real-life input sequences, or small example ones like above, Or with or without N characters, in addition to the A, C, G, T
For example above, I even changed window length to 10
I'm not sure what is happening! Perhaps you can share the code that works with this example (script version using in and out file handles)
Sorry for the bother, but thanks
| [reply] [d/l] |
#!/usr/bin/perl
# https://perlmonks.org/?node_id=1228191
use strict;
use warnings;
my $window = 1e6;
my $A = my $C = my $G = my $all = 0;
my (@sizes, $tmp, $start);
my $inputfile = shift // 'd.1228191';
my $outputfile = shift // 'd.out.1228191';
open my $in, '<', $inputfile or die "$! opening $inputfile";
open my $out, '>', $outputfile or die "$! opening $outputfile";
sub letter
{
my $n = int rand $all--;
$n < $A ? ($A--, return 'A') :
$n < $A + $C ? ($C--, return 'C') :
$n < $A + $C + $G ? ($G--, return 'G') :
return 'T';
}
sub output
{
for my $count ( @sizes )
{
print $out ">ID", $start++, "\n", map(letter(), 1 .. $count), "\n"
+;
}
@sizes = ();
}
while( <$in> )
{
if( /^>/ )
{
$start //= s/\D+//gr;
}
elsif( /^[ACGT]/ )
{
$A += tr/A//;
$C += tr/C//;
$G += tr/G//;
$all += $tmp = tr/ACGT//;
push @sizes, $tmp;
$all >= $window and output();
}
}
$all and output();
close $in;
close $out;
| [reply] [d/l] |
Oh no, sorry! That was a "duh" moment from me! :)
How should the script be modified to use the original IDs?
I did add a few lines to your script, to reflect the possibility of Ns in the input, and those lines seem to work. I also just discovered that there appear to be some OTHER characters apart from A/C/G/T/N. The OTHERS could be one of R/Y/S/W/K/M/B/D/H/V. What is your advice for how to account for these? I hard coded those changes in a modified version of your script, as shown below, does it look alright?
Importantly, what is the part of your script that generates randomness in the sequence? Is it the map(letter(), 1 .. $count)?
I ask because of 2 reasons:
1. The goal is to randomly shuffle DNA, which is the obvious reason.
2. So that I can be sure that any two outputs with the same input are going to be different, due to this randomness
Many thanks!
#!/usr/bin/perl
#tybalt89_DNAfreq_Random_Generator.pl
# https://perlmonks.org/?node_id=1228191
use strict;
use warnings;
my $window = 1e6;
my $A = my $C = my $G = my $T = my $N = my $all = 0;
my $R=0; my $Y=0; my $S=0; my $W=0; my $K=0; my $M=0; my $B=0; my $D=0
+; my $H=0; my $V=0;
my (@sizes, $tmp, $start);
my $in = shift @ARGV;
my $out = shift @ARGV;
# print $in, "\n";
# print $out, "\n";
open IN, '<', $in or die "$! opening $in";
open OUT, '>', $out or die "$! opening $out";
sub letter
{
my $n = int rand $all--;
$n < $A ? ($A--, return 'A') :
$n < $A + $C ? ($C--, return 'C') :
$n < $A + $C + $G ? ($G--, return 'G') :
$n < $A + $C + $G + $T ? ($T--, return 'T') :
$n < $A + $C + $G + $T + $R ? ($R--, return 'R') :
$n < $A + $C + $G + $T + $R + $Y ? ($Y--, return 'Y') :
$n < $A + $C + $G + $T + $R + $Y + $S ? ($S--, return 'S') :
$n < $A + $C + $G + $T + $R + $Y + $S + $W ? ($W--, return 'W') :
$n < $A + $C + $G + $T + $R + $Y + $S + $W + $K ? ($K--, return 'K
+') :
$n < $A + $C + $G + $T + $R + $Y + $S + $W + $K + $M ? ($M--, retu
+rn 'M') :
$n < $A + $C + $G + $T + $R + $Y + $S + $W + $K + $M + $B ? ($B--,
+ return 'B') :
$n < $A + $C + $G + $T + $R + $Y + $S + $W + $K + $M + $B + $D ? (
+$D--, return 'D') :
$n < $A + $C + $G + $T + $R + $Y + $S + $W + $K + $M + $B + $D + $
+H ? ($H--, return 'H') :
$n < $A + $C + $G + $T + $R + $Y + $S + $W + $K + $M + $B + $D + $
+H + $V ? ($V--, return 'V') :
return 'N';
}
sub output
{
for my $count ( @sizes )
{
# print ">ID", $start++, "\n", map(letter(), 1 .. $count), "\n";
print OUT ">ID", $start++, "\n", map(letter(), 1 .. $count), "\n";
}
@sizes = ();
}
while( <IN> )
{
if( /^>/ )
{
$start //= s/\D+//gr;
}
elsif( /^[ACGTRYSWKMBDHVN]/ )
{
$A += tr/A//;
$C += tr/C//;
$G += tr/G//;
$T += tr/T//;
$R += tr/R//;
$Y += tr/Y//;
$S += tr/S//;
$W += tr/W//;
$K += tr/K//;
$M += tr/M//;
$B += tr/B//;
$D += tr/D//;
$H += tr/H//;
$V += tr/V//;
$N += tr/N//;
$all += $tmp = tr/ACGTRYSWKMBDHVN//;
push @sizes, $tmp;
$all >= $window and output();
}
}
$all and output();
close IN;
close OUT;
| [reply] [d/l] |