This is a continuation of Re: Problem in RAM usage while threading the program.
Update 1: Added parallel chunking demonstration.
Update 2: Inline::C demonstrations can be found here.
Part Three create the Kyoto Cabinet DB
This is where the fun begins. The # after the file name is how to pass options. See the perldoc for more info. Notice the split line. There is no reason to split the entire sequence header line.
# https://www.perlmonks.org/?node_id=11110379
# usage: perl create_db.pl
use strict;
use warnings;
use KyotoCabinet;
unlink 'db.kch';
# construct the database object
my $db = KyotoCabinet::DB->new();
# open the database
if (! $db->open('db.kch#msiz=128m', $db->OWRITER | $db->OCREATE)) {
die "open error (db): ", $db->error;
}
open my $fh, '<', 'NR.fasta' or die "open error: $!\n";
my $nrid = '';
my $seq = '';
while ( <$fh> ) {
chomp; $_ =~ s/\r$//g;
if ( /^>/ ) {
$db->set($nrid, $seq) if $seq;
$nrid = ( split /\s/, $_, 2 )[0];
$seq = '';
}
else {
$seq .= $_;
}
}
$db->set($nrid, $seq) if $seq;
close $fh;
$db->close;
Part Four search DB serial code
One might like to measure the difference between running serially and parallel. This makes use of cursor in Kyoto Cabinet. The substr is for stripping the leading comma when displaying the value to STDOUT.
# https://www.perlmonks.org/?node_id=11110379
# usage: perl search_db.pl > Outfile.txt
use strict;
use warnings;
use KyotoCabinet;
# construct the database object
my $db = KyotoCabinet::DB->new();
# open the database
if (! $db->open('db.kch#msiz=128m', $db->OREADER | $db->ONOLOCK)) {
die "open error (db): ", $db->error;
}
my $cur = $db->cursor;
open my $fh, '<', 'peptides.txt' or die "open error: $!\n";
while ( my $pep = <$fh> ) {
my $ids = ''; chomp $pep;
$cur->jump; # first record
while ( my ($key, $val) = $cur->get(1) ) {
$ids .= ",$key" if index($val, $pep) >= 0;
}
print "$pep\t", substr($ids, 1), "\n" if $ids;
}
close $fh;
$cur->disable;
$db->close;
Part Five search DB parallel code
Here it is. This is the parallel code. It's practically the serial code above wrapped into MCE. Each worker opens a handle to the DB. This is done inside the user_begin block. Likewise, closing is done inside the user_end block. MCE::Relay is helpful for ensuring workers write serially to STDOUT. Thus, no reason to have workers send output to the manager process or store in a shared hash (i.e. MCE::Shared).
# https://www.perlmonks.org/?node_id=11110379
# usage: perl search_db_mce.pl > Outfile.txt
use strict;
use warnings;
use KyotoCabinet;
use MCE;
# construct the database object
my $db = KyotoCabinet::DB->new();
my $cur;
my $mce = MCE->new(
max_workers => MCE::Util::get_ncpu(),
chunk_size => 1,
init_relay => 1,
user_begin => sub {
# open the database
if (! $db->open('db.kch#msiz=128m', $db->OREADER | $db->ONOLOC
+K)) {
die "open error (db): ", $db->error;
}
$cur = $db->cursor;
},
user_end => sub {
# close the database
$cur->disable;
$db->close;
},
user_func => sub {
my $pep = $_; chomp $pep;
my $ids = '';
$cur->jump; # first record
while ( my ($key, $val) = $cur->get(1) ) {
$ids .= ",$key" if index($val, $pep) >= 0;
}
# output serially, one worker at a time
MCE::relay {
print "$pep\t", substr($ids, 1), "\n" if $ids;
};
}
);
$mce->process('peptides.txt');
$mce->shutdown;
Part Six search DB parallel code - chunking
Chunking helps reduce time by another 25% ~ 30% for this use case. This is beneficial on machines with many cores.
# https://www.perlmonks.org/?node_id=11110379
# usage: perl search_db_chunk.pl > Outfile.txt
use strict;
use warnings;
use KyotoCabinet;
use MCE;
# construct the database object
my $db = KyotoCabinet::DB->new();
my $cur;
my $mce = MCE->new(
max_workers => MCE::Util::get_ncpu(),
chunk_size => 7,
init_relay => 1,
user_begin => sub {
# open the database
if (! $db->open('db.kch#msiz=128m', $db->OREADER | $db->ONOLOC
+K)) {
die "open error (db): ", $db->error;
}
$cur = $db->cursor;
},
user_end => sub {
# close the database
$cur->disable;
$db->close;
},
user_func => sub {
my ( $mce, $chunk_ref, $chunk_id ) = @_;
my %ret; chomp @$chunk_ref;
$cur->jump; # first record
while ( my ($key, $val) = $cur->get(1) ) {
for ( @$chunk_ref ) {
$ret{$_} .= ",$key" if index($val, $_) >= 0;
}
}
# output serially, one worker at a time
MCE::relay {
for ( @$chunk_ref ) {
print "$_\t", substr($ret{$_}, 1), "\n" if $ret{$_};
}
};
}
);
$mce->process('peptides.txt');
$mce->shutdown;
Outfile.txt
GAAGGACTGGGACCA >NR_000001,>NR_006611
AGGCTGCGGCAGGAC >NR_062102
GTGAGCCGGGCAGAG >NR_089584
AGGGGGGGTTGCTGA >NR_036454,>NR_068535,>NR_097889
CTGACATGCGGCGCA >NR_087289
GTGCATCGATGGCCG >NR_005535
GGGGTCAAGCGAACC >NR_076289,>NR_087856
TGAGACGGCGAACCT >NR_064242
AGCGACAAAGGAAAC >NR_045865
AGGTGCAACCATGGA >NR_046602,>NR_056869
GAGTAAACCGCGCGA >NR_093455
AACGACTGAACAGCG >NR_070693
ACGCGTAATTCGATA >NR_080086
GATGAGCGGAGCACT >NR_070118
CGTAGCGAAACCGAG >NR_092384,>NR_098291
GGGGGGGAGGTCCGA >NR_021671,>NR_036907,>NR_080961
AGGGAGGGGGGTTGT >NR_026207
ATGGGGCAGACGCGA >NR_072314
Regards, Mario
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.