Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re^2: storing hash in temporary files to save memory usage

by marioroy (Prior)
on Sep 02, 2017 at 11:22 UTC ( [id://1198574]=note: print w/replies, xml ) Need Help??


in reply to Re: storing hash in temporary files to save memory usage
in thread storing hash in temporary files to save memory usage

Update 1: Added B+ tree results for DB_File, BerkeleyDB, and TokyoCabinet.
Update 2: Added results for in-memory consumption and hash databases.
Update 3: See Kyoto Tycoon key-value store (and the underlying Kyoto Cabinet library).
Update 4: Resolved issue with Kyoto Cabinet (tree) failing random fetch.

Regarding DBM files, I'm not aware of anything faster than Kyoto Cabinet, successor of Tokyo Cabinet. Sorting isn't necessary when storing into a B+ tree database. The .kct extension will have records organized using a B+ tree database. Once key-value pairs are stored, the performance of sequential access is much faster than that of random access.

Testing was done on a Macbook Pro, late 2013 i7-Haswell @ 2.6 GHz, using Perl 5.26.0. The CPU TurboBoost may run as high as 3.8 GHz on one core. Unfortunately, I do not have anything slower to run on. The take from this is that Kyoto Cabinet is fastest and smallest of the bunch.

use strict; use warnings; use BerkeleyDB; use DB_File; use TokyoCabinet; use KyotoCabinet; unlink qw( /tmp/file.db /tmp/file.tch /tmp/file.kch ); unlink qw( /tmp/file.tct /tmp/file.kct ); # -- # my $ob = tie my %hash, 'BerkeleyDB::Hash', # -Filename => '/tmp/file.db', -Flags => DB_CREATE # or die "open error: $!"; # # my $ob = tie my %hash, 'BerkeleyDB::Btree', # -Filename => '/tmp/file.db', -Flags => DB_CREATE # or die "open error: $!"; # # my $ob = tie my %hash, 'DB_File', # '/tmp/file.db', O_RDWR|O_CREAT, 0644, $DB_HASH # or die "open error: $!"; # # my $ob = tie my %hash, 'DB_File', # '/tmp/file.db', O_RDWR|O_CREAT, 0644, $DB_BTREE # or die "open error: $!"; # # my $ob = tie my %hash, 'TokyoCabinet::HDB', '/tmp/file.tch', # TokyoCabinet::HDB::OWRITER | TokyoCabinet::HDB::OCREAT # or die "open error: $!"; # # my $ob = tie my %hash, 'TokyoCabinet::BDB', '/tmp/file.tcb', # TokyoCabinet::BDB::OWRITER | TokyoCabinet::BDB::OCREAT # or die "open error: $!"; # # my $ob = tie my %hash, 'KyotoCabinet::DB', '/tmp/file.kch', # KyotoCabinet::DB::OWRITER | KyotoCabinet::DB::OCREATE # or die "open error: $!"; # my $ob = tie my %hash, 'KyotoCabinet::DB', '/tmp/file.kct', KyotoCabinet::DB::OWRITER | KyotoCabinet::DB::OCREATE or die "open error: $!"; # -- # Tie interface : 23.875 seconds, 793 MiB - BerkeleyDB::Btree # 19.812 seconds, 793 MiB - DB_File $DB_BTREE # 19.208 seconds, 353 MiB - TokyoCabinet *.tcb # 17.232 seconds, 306 MiB - KyotoCabinet *.kct # # for ( 1 .. 10e6 ) { # $hash{$_} = "$_ some string..."; # } # # OO interface : 82.573 seconds, 639 MiB - BerkeleyDB::Hash # 73.383 seconds, 639 MiB - DB_File $DB_HASH # 87.695 seconds, 458 MiB - TokyoCabinet *.tch # 38.312 seconds, 464 MiB - KyotoCabinet *.kch # # 19.899 seconds, 793 MiB - BerkeleyDB::Btree # 14.340 seconds, 793 MiB - DB_File $DB_BTREE # 14.763 seconds, 353 MiB - TokyoCabinet *.tcb # 10.970 seconds, 306 MiB - KyotoCabinet *.kct # for ( 1 .. 10e6 ) { $ob->STORE($_ => "$_ some string..."); }

For Mac users, Kyoto Cabinet requires patching 3 files, found here. The macports file is found here. Tokyo Cabinet builds fine without manual intervention (not shown below). Finally, the Perl driver. The documentation can be found under the doc dir.

$ tar xf $HOME/Downloads/kyotocabinet-1.2.76.tar.gz $ cd kyotocabinet-1.2.76 $ patch -p0 < $HOME/Downloads/patch-kccommon.h.diff $ patch -p0 < $HOME/Downloads/patch-configure.diff $ patch -p0 < $HOME/Downloads/patch-kcthread.cc $ ./configure --disable-lzo --disable-lzma $ make -j2 $ sudo make install
$ tar xf $HOME/Downloads/kyotocabinet-perl-1.20.tar.gz $ cd kyotocabinet-perl-1.20 $ perl Makefile.PL $ sudo make install $ cd doc $ open index.html

One may run entirely from memory. Simply replace the filename with '*' for a cache hash database or '%' for a cache tree database. The memory footprint is less than half compared to Perl's native hash. Kyoto Cabinet consumes 4.2x smaller in-memory footprint versus a plain hash. Storing key-value pairs doesn't take longer either.

my $ob1 = tie my %h1, 'TokyoCabinet::ADB', '*'; # in-memory hash my $ob2 = tie my %h2, 'TokyoCabinet::ADB', '+'; # in-memory tree my $ob3 = tie my %h3, 'KyotoCabinet::DB', '*'; # in-memory hash my $ob4 = tie my %h4, 'KyotoCabinet::DB', '%'; # in-memory tree
use strict; use warnings; use Time::HiRes 'time'; use TokyoCabinet; use KyotoCabinet; my %hash; my $ob2 = tie my %h2, 'TokyoCabinet::ADB', '+'; # in-memory tree my $ob4 = tie my %h4, 'KyotoCabinet::DB', '%'; # in-memory tree my $start = time; # Plain hash 1911 MiB, 10.182 seconds # for ( 1 .. 10e6 ) { # $hash{$_} = "$_ some string..."; # } # Tokyo Cabinet 627 MiB, 10.165 seconds # for ( 1 .. 10e6 ) { # $ob2->STORE($_ => "$_ some string..."); # } # Kyoto Cabinet 453 MiB, 10.062 seconds for ( 1 .. 10e6 ) { $ob4->STORE($_ => "$_ some string..."); } printf {*STDERR} "capture memory consumption in top: %0.03f\n", time - $start; 1 for ( 1 .. 2e8 );

For some unknown reason, accessing an in-memory B+ tree database randomly is taking a long time with Kyoto Cabinet that I stopped the script after 40 seconds. Thus, compared the in-memory hash database instead. Appending the pccap=256m option resolved the issue. That increases the default page cache memory to 256 MiB.

use strict; use warnings; use List::Util 'shuffle'; use Time::HiRes 'time'; use TokyoCabinet; use KyotoCabinet; srand 0; my %hash; my $ob2 = tie my %h2, 'TokyoCabinet::ADB', '+'; # Tree my $ob4 = tie my %h4, 'KyotoCabinet::DB', '%#pccap=256m'; # Tree my $size = 5e6; my $start; my @keys = shuffle 1 .. $size; # plain hash 4.342 seconds # for ( 1 .. $size ) { # $hash{$_} = "$_ some string..."; # } # $start = time; # for ( @keys ) { # my $v = $hash{$_}; # } # TokyoCabinet 11.572 seconds '+' tree # TokyoCabinet 8.936 seconds '*' hash # for ( 1 .. $size ) { # $ob2->STORE($_ => "$_ some string..."); # } # $start = time; # for ( @keys ) { # my $v = $ob2->FETCH($_); # } # KyotoCabinet 11.991 seconds '%' tree # KyotoCabinet 6.087 seconds '*' hash for ( 1 .. $size ) { $ob4->STORE($_ => "$_ some string..."); } $start = time; for ( @keys ) { my $v = $ob4->FETCH($_); } printf "duration: %0.03f seconds\n", time - $start;

See this page for specific tuning parameters. Particularly #pccap=256m for tree databases and the #capsiz option for in-memory hash databases. Likewise, the #bnum option for tuning the number of buckets (should be set to about twice the number of expected keys). Append options to the filename argument.

"/tmp/file.kch#bnum=5000000" # hash "/tmp/file.kct#pccap=256m" # tree "*#bnum=5000000#capsiz=1024m" # in-memory hash "%#pccap=256m" # in-memory tree

What I've learned during this experience is that one must try both hash and B+ tree databases. Depending on the application, it may favor one over the other.

Regards, Mario

Replies are listed 'Best First'.
Re^3: storing hash in temporary files to save memory usage
by Laurent_R (Canon) on Sep 02, 2017 at 12:11 UTC
    Thanks a lot, Mario, for this very interesting information. I'll give it a try.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1198574]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (4)
As of 2024-04-19 21:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found