Re^2: storing hash in temporary files to save memory usage

Update 1: Added B+ tree results for DB_File, BerkeleyDB, and TokyoCabinet.
Update 2: Added results for in-memory consumption and hash databases.
Update 3: See Kyoto Tycoon key-value store (and the underlying Kyoto Cabinet library).
Update 4: Resolved issue with Kyoto Cabinet (tree) failing random fetch.

Regarding DBM files, I'm not aware of anything faster than Kyoto Cabinet, successor of Tokyo Cabinet. Sorting isn't necessary when storing into a B+ tree database. The .kct extension will have records organized using a B+ tree database. Once key-value pairs are stored, the performance of sequential access is much faster than that of random access.

Testing was done on a Macbook Pro, late 2013 i7-Haswell @ 2.6 GHz, using Perl 5.26.0. The CPU TurboBoost may run as high as 3.8 GHz on one core. Unfortunately, I do not have anything slower to run on. The take from this is that Kyoto Cabinet is fastest and smallest of the bunch.

  use strict;
  use warnings;

  use BerkeleyDB;
  use DB_File;
  use TokyoCabinet;
  use KyotoCabinet;

  unlink qw( /tmp/file.db /tmp/file.tch /tmp/file.kch );
  unlink qw(              /tmp/file.tct /tmp/file.kct );

# --

# my $ob = tie my %hash, 'BerkeleyDB::Hash',
#       -Filename => '/tmp/file.db', -Flags => DB_CREATE
#    or die "open error: $!";
#
# my $ob = tie my %hash, 'BerkeleyDB::Btree',
#       -Filename => '/tmp/file.db', -Flags => DB_CREATE
#    or die "open error: $!";
#

# my $ob = tie my %hash, 'DB_File',
#       '/tmp/file.db', O_RDWR|O_CREAT, 0644, $DB_HASH
#    or die "open error: $!";
#
# my $ob = tie my %hash, 'DB_File',
#       '/tmp/file.db', O_RDWR|O_CREAT, 0644, $DB_BTREE
#    or die "open error: $!";
#

# my $ob = tie my %hash, 'TokyoCabinet::HDB', '/tmp/file.tch',
#       TokyoCabinet::HDB::OWRITER | TokyoCabinet::HDB::OCREAT
#    or die "open error: $!";
#
# my $ob = tie my %hash, 'TokyoCabinet::BDB', '/tmp/file.tcb',
#       TokyoCabinet::BDB::OWRITER | TokyoCabinet::BDB::OCREAT
#    or die "open error: $!";
#

# my $ob = tie my %hash, 'KyotoCabinet::DB', '/tmp/file.kch',
#       KyotoCabinet::DB::OWRITER | KyotoCabinet::DB::OCREATE
#    or die "open error: $!";
#
  my $ob = tie my %hash, 'KyotoCabinet::DB', '/tmp/file.kct',
        KyotoCabinet::DB::OWRITER | KyotoCabinet::DB::OCREATE
     or die "open error: $!";

# --

# Tie interface : 23.875 seconds, 793 MiB  - BerkeleyDB::Btree
#                 19.812 seconds, 793 MiB  - DB_File $DB_BTREE
#                 19.208 seconds, 353 MiB  - TokyoCabinet *.tcb
#                 17.232 seconds, 306 MiB  - KyotoCabinet *.kct
#
# for ( 1 .. 10e6 ) {
#     $hash{$_} = "$_ some string...";
# }
#

# OO interface  : 82.573 seconds, 639 MiB  - BerkeleyDB::Hash
#                 73.383 seconds, 639 MiB  - DB_File $DB_HASH
#                 87.695 seconds, 458 MiB  - TokyoCabinet *.tch
#                 38.312 seconds, 464 MiB  - KyotoCabinet *.kch
#
#                 19.899 seconds, 793 MiB  - BerkeleyDB::Btree
#                 14.340 seconds, 793 MiB  - DB_File $DB_BTREE
#                 14.763 seconds, 353 MiB  - TokyoCabinet *.tcb
#                 10.970 seconds, 306 MiB  - KyotoCabinet *.kct
#
  for ( 1 .. 10e6 ) {
      $ob->STORE($_ => "$_ some string...");
  }
[download]

For Mac users, Kyoto Cabinet requires patching 3 files, found here. The macports file is found here. Tokyo Cabinet builds fine without manual intervention (not shown below). Finally, the Perl driver. The documentation can be found under the doc dir.

$ tar xf $HOME/Downloads/kyotocabinet-1.2.76.tar.gz
$ cd kyotocabinet-1.2.76

$ patch -p0 < $HOME/Downloads/patch-kccommon.h.diff
$ patch -p0 < $HOME/Downloads/patch-configure.diff
$ patch -p0 < $HOME/Downloads/patch-kcthread.cc

$ ./configure --disable-lzo --disable-lzma

$ make -j2
$ sudo make install
[download]

$ tar xf $HOME/Downloads/kyotocabinet-perl-1.20.tar.gz
$ cd kyotocabinet-perl-1.20

$ perl Makefile.PL
$ sudo make install

$ cd doc
$ open index.html
[download]

One may run entirely from memory. Simply replace the filename with '*' for a cache hash database or '%' for a cache tree database. The memory footprint is less than half compared to Perl's native hash. Kyoto Cabinet consumes 4.2x smaller in-memory footprint versus a plain hash. Storing key-value pairs doesn't take longer either.

my $ob1 = tie my %h1, 'TokyoCabinet::ADB', '*';  # in-memory hash
my $ob2 = tie my %h2, 'TokyoCabinet::ADB', '+';  # in-memory tree

my $ob3 = tie my %h3, 'KyotoCabinet::DB',  '*';  # in-memory hash
my $ob4 = tie my %h4, 'KyotoCabinet::DB',  '%';  # in-memory tree
[download]

use strict;
use warnings;

use Time::HiRes 'time';
use TokyoCabinet;
use KyotoCabinet;

my %hash;
my $ob2 = tie my %h2, 'TokyoCabinet::ADB', '+';  # in-memory tree
my $ob4 = tie my %h4, 'KyotoCabinet::DB',  '%';  # in-memory tree
my $start = time;

# Plain hash                   1911 MiB, 10.182 seconds
# for ( 1 .. 10e6 ) {
#     $hash{$_} = "$_ some string...";
# }

# Tokyo Cabinet                 627 MiB, 10.165 seconds
# for ( 1 .. 10e6 ) {
#     $ob2->STORE($_ => "$_ some string...");
# }

# Kyoto Cabinet                 453 MiB, 10.062 seconds
  for ( 1 .. 10e6 ) {
      $ob4->STORE($_ => "$_ some string...");
  }

printf {*STDERR} "capture memory consumption in top: %0.03f\n",
    time - $start;

1 for ( 1 .. 2e8 );
[download]

For some unknown reason, accessing an in-memory B+ tree database randomly is taking a long time with Kyoto Cabinet that I stopped the script after 40 seconds. Thus, compared the in-memory hash database instead. Appending the pccap=256m option resolved the issue. That increases the default page cache memory to 256 MiB.

use strict;
use warnings;

use List::Util 'shuffle';
use Time::HiRes 'time';
use TokyoCabinet;
use KyotoCabinet;

srand 0;

my %hash;
my $ob2  = tie my %h2, 'TokyoCabinet::ADB', '+';             # Tree
my $ob4  = tie my %h4, 'KyotoCabinet::DB',  '%#pccap=256m';  # Tree
my $size = 5e6;
my $start;

my @keys = shuffle 1 .. $size;

# plain hash                   4.342 seconds
# for ( 1 .. $size ) {
#     $hash{$_} = "$_ some string...";
# }
# $start = time;
# for ( @keys ) {
#     my $v = $hash{$_};
# }

# TokyoCabinet                11.572 seconds '+' tree
# TokyoCabinet                 8.936 seconds '*' hash
# for ( 1 .. $size ) {
#     $ob2->STORE($_ => "$_ some string...");
# }
# $start = time;
# for ( @keys ) {
#     my $v = $ob2->FETCH($_);
# }

# KyotoCabinet                11.991 seconds '%' tree
# KyotoCabinet                 6.087 seconds '*' hash
  for ( 1 .. $size ) {
      $ob4->STORE($_ => "$_ some string...");
  }
  $start = time;
  for ( @keys ) {
      my $v = $ob4->FETCH($_);
  }

printf "duration: %0.03f seconds\n", time - $start;
[download]

See this page for specific tuning parameters. Particularly #pccap=256m for tree databases and the #capsiz option for in-memory hash databases. Likewise, the #bnum option for tuning the number of buckets (should be set to about twice the number of expected keys). Append options to the filename argument.

"/tmp/file.kch#bnum=5000000" # hash
"/tmp/file.kct#pccap=256m" # tree

"*#bnum=5000000#capsiz=1024m" # in-memory hash
"%#pccap=256m" # in-memory tree
[download]

What I've learned during this experience is that one must try both hash and B+ tree databases. Depending on the application, it may favor one over the other.

Regards, Mario

Comment on Re^2: storing hash in temporary files to save memory usage Select or Download Code


We don't bite newbies here... much
	PerlMonks