Dear BrowserUK,
Your magical approach is at least 6980x faster than mine!
My approach:
191.52user 153.47system 5:49.44elapsed 98%CPU
BrowserUk Perl's approach (modified to handle FASTA):
open SEQ, '<', $ARGV[0] or die $!; #plain
open MASK, '<', $ARGV[1] or die $!; #hardmask
while ( my $seq = <SEQ> ) { ## Read a sequence
my $mask = <MASK>; ## And the corresponding mask
if ( $mask =~ /^>/ ) {
print "$seq";
}
else {
$mask =~ tr[N][ ]; ## Ns => spaces
print $seq | $mask; ## bitwise-OR them and print the re
+sult;
}
}
close SEQ;
close MASK;
Takes only this much time:
0.45user 0.07system 0:05.59elapsed 9%CPU
This is tested on 12MB dataset, and the breakdown of
sequence length is this:
#Seq_name #Seq_len
2-micron 6318
MT 85779
I 230208
VI 270148
III 316617
IX 439885
VIII 562643
V 576869
XI 666454
X 745745
XIV 784333
II 813178
XIII 924429
XVI 948062
XII 1078175
VII 1090946
XV 1091289
IV 1531918
Update: These test datasets (full and hardmasked) can be downloaded
here.
---
neversaint and everlastingly indebted.......
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.