note
marioroy
<p>This is a continuation of [id://11110418].</p>
<p><b>Update 1:</b> Added parallel chunking demonstration.<br>
<b>Update 2:</b> Inline::C demonstrations can be found [id://11110528|here].</p>
<p><b>Part Three</b> create the Kyoto Cabinet DB</p>
<p>This is where the fun begins. The # after the file name is how to pass options. See the [https://fallabs.com/kyotocabinet/perldoc/|perldoc] for more info. Notice the split line. There is no reason to split the entire sequence header line.</p>
<code>
# https://www.perlmonks.org/?node_id=11110379
# usage: perl create_db.pl
use strict;
use warnings;
use KyotoCabinet;
unlink 'db.kch';
# construct the database object
my $db = KyotoCabinet::DB->new();
# open the database
if (! $db->open('db.kch#msiz=128m', $db->OWRITER | $db->OCREATE)) {
die "open error (db): ", $db->error;
}
open my $fh, '<', 'NR.fasta' or die "open error: $!\n";
my $nrid = '';
my $seq = '';
while ( <$fh> ) {
chomp; $_ =~ s/\r$//g;
if ( /^>/ ) {
$db->set($nrid, $seq) if $seq;
$nrid = ( split /\s/, $_, 2 )[0];
$seq = '';
}
else {
$seq .= $_;
}
}
$db->set($nrid, $seq) if $seq;
close $fh;
$db->close;
</code>
<p><b>Part Four</b> search DB serial code</p>
<p>One might like to measure the difference between running serially and parallel. This makes use of cursor in Kyoto Cabinet. The substr is for stripping the leading comma when displaying the value to STDOUT.</p>
<code>
# https://www.perlmonks.org/?node_id=11110379
# usage: perl search_db.pl > Outfile.txt
use strict;
use warnings;
use KyotoCabinet;
# construct the database object
my $db = KyotoCabinet::DB->new();
# open the database
if (! $db->open('db.kch#msiz=128m', $db->OREADER | $db->ONOLOCK)) {
die "open error (db): ", $db->error;
}
my $cur = $db->cursor;
open my $fh, '<', 'peptides.txt' or die "open error: $!\n";
while ( my $pep = <$fh> ) {
my $ids = ''; chomp $pep;
$cur->jump; # first record
while ( my ($key, $val) = $cur->get(1) ) {
$ids .= ",$key" if index($val, $pep) >= 0;
}
print "$pep\t", substr($ids, 1), "\n" if $ids;
}
close $fh;
$cur->disable;
$db->close;
</code>
<p><b>Part Five</b> search DB parallel code</p>
<p>Here it is. This is the parallel code. It's practically the serial code above wrapped into MCE. Each worker opens a handle to the DB. This is done inside the user_begin block. Likewise, closing is done inside the user_end block. MCE::Relay is helpful for ensuring workers write serially to STDOUT. Thus, no reason to have workers send output to the manager process or store in a shared hash (i.e. MCE::Shared).</p>
<code>
# https://www.perlmonks.org/?node_id=11110379
# usage: perl search_db_mce.pl > Outfile.txt
use strict;
use warnings;
use KyotoCabinet;
use MCE;
# construct the database object
my $db = KyotoCabinet::DB->new();
my $cur;
my $mce = MCE->new(
max_workers => MCE::Util::get_ncpu(),
chunk_size => 1,
init_relay => 1,
user_begin => sub {
# open the database
if (! $db->open('db.kch#msiz=128m', $db->OREADER | $db->ONOLOCK)) {
die "open error (db): ", $db->error;
}
$cur = $db->cursor;
},
user_end => sub {
# close the database
$cur->disable;
$db->close;
},
user_func => sub {
my $pep = $_; chomp $pep;
my $ids = '';
$cur->jump; # first record
while ( my ($key, $val) = $cur->get(1) ) {
$ids .= ",$key" if index($val, $pep) >= 0;
}
# output serially, one worker at a time
MCE::relay {
print "$pep\t", substr($ids, 1), "\n" if $ids;
};
}
);
$mce->process('peptides.txt');
$mce->shutdown;
</code>
<p><b>Part Six</b> search DB parallel code - chunking</p>
<p>Chunking helps reduce time by another 25% ~ 30% for this use case. This is beneficial on machines with many cores.</p>
<code>
# https://www.perlmonks.org/?node_id=11110379
# usage: perl search_db_chunk.pl > Outfile.txt
use strict;
use warnings;
use KyotoCabinet;
use MCE;
# construct the database object
my $db = KyotoCabinet::DB->new();
my $cur;
my $mce = MCE->new(
max_workers => MCE::Util::get_ncpu(),
chunk_size => 7,
init_relay => 1,
user_begin => sub {
# open the database
if (! $db->open('db.kch#msiz=128m', $db->OREADER | $db->ONOLOCK)) {
die "open error (db): ", $db->error;
}
$cur = $db->cursor;
},
user_end => sub {
# close the database
$cur->disable;
$db->close;
},
user_func => sub {
my ( $mce, $chunk_ref, $chunk_id ) = @_;
my %ret; chomp @$chunk_ref;
$cur->jump; # first record
while ( my ($key, $val) = $cur->get(1) ) {
for ( @$chunk_ref ) {
$ret{$_} .= ",$key" if index($val, $_) >= 0;
}
}
# output serially, one worker at a time
MCE::relay {
for ( @$chunk_ref ) {
print "$_\t", substr($ret{$_}, 1), "\n" if $ret{$_};
}
};
}
);
$mce->process('peptides.txt');
$mce->shutdown;
</code>
<p><b>Outfile.txt</b></p>
<code>
GAAGGACTGGGACCA >NR_000001,>NR_006611
AGGCTGCGGCAGGAC >NR_062102
GTGAGCCGGGCAGAG >NR_089584
AGGGGGGGTTGCTGA >NR_036454,>NR_068535,>NR_097889
CTGACATGCGGCGCA >NR_087289
GTGCATCGATGGCCG >NR_005535
GGGGTCAAGCGAACC >NR_076289,>NR_087856
TGAGACGGCGAACCT >NR_064242
AGCGACAAAGGAAAC >NR_045865
AGGTGCAACCATGGA >NR_046602,>NR_056869
GAGTAAACCGCGCGA >NR_093455
AACGACTGAACAGCG >NR_070693
ACGCGTAATTCGATA >NR_080086
GATGAGCGGAGCACT >NR_070118
CGTAGCGAAACCGAG >NR_092384,>NR_098291
GGGGGGGAGGTCCGA >NR_021671,>NR_036907,>NR_080961
AGGGAGGGGGGTTGT >NR_026207
ATGGGGCAGACGCGA >NR_072314
</code>
<p>Regards, Mario</p>
11110379
11110418