Re: Speed up file write taking weeks

in reply to Speed up file write taking weeks

Update 1: Added install instructions
Update 2: Updated stage_a.pl, stage_b.pl

For something like this, one might want to do this in stages. But first, I like to introduce to you a fast database named Kyoto Cabinet. It is quite fast. See more info here, on PerlMonks.

Installation

Kyoto Cabinet site, source, and Perl module.
https://fallabs.com/kyotocabinet/
https://fallabs.com/kyotocabinet/pkg/
https://fallabs.com/kyotocabinet/perlpkg/
https://fallabs.com/kyotocabinet/perldoc/

MacPorts patch files.
http://ftp.mirrorservice.org/sites/rsync.macports.org/release/ports/databases/kyotocabinet/files/

More info can be found here (i.e. tuning parameters after the file name).
https://github.com/alticelabs/kyoto

$ tar xf $HOME/Downloads/kyotocabinet-1.2.77.tar.gz
$ cd kyotocabinet-1.2.77

$ patch -p0 < $HOME/Downloads/patch-configure.diff
$ patch -p0 < $HOME/Downloads/patch-kcthread.cc
$ patch -p0 < $HOME/Downloads/patch-kccommon.h.diff

$ ./configure --disable-lzo --disable-lzma

$ make -j2
$ sudo make install
$ cd ..
[download]

$ tar xf $HOME/Downloads/kyotocabinet-perl-1.20.tar.gz
$ cd kyotocabinet-perl-1.20

$ perl Makefile.PL
$ make
$ make test
$ sudo make install

$ cd doc
$ open index.html in a web browser
$ cd ../..
[download]

Demonstration

I created approximately 10 and 17 million records for input_a.txt and input_b.txt, respectively, using the following scripts. The input files remain sorted.

use strict;
use warnings;

open my $fh_a, '>', 'input_a.txt' or die "open error: $!";

foreach my $i ( 1 .. 333_333 * 10 ) {
    my $prefix = sprintf('%09s', $i);
    print $fh_a "${prefix}D x\n";
    print $fh_a "${prefix}D y\n";
    print $fh_a "${prefix}E t\n";
}

close $fh_a;
[download]

use strict;
use warnings;

open my $fh_b, '>', 'input_b.txt' or die "open error: $!";

foreach my $i ( 1 .. 333_333 * 10 ) {
    my $prefix = sprintf('%09s', $i);
    print $fh_b "${prefix}D m\n";
    print $fh_b "${prefix}D n\n";
    print $fh_b "${prefix}D o\n";
    print $fh_b "${prefix}E m\n";
    print $fh_b "${prefix}E s\n";
}

close $fh_b;
[download]

Afterwards, run stage_a.pl to create the Kyoto Cabinet database. This is quite fast and stores 10 million records in little time.

use strict;
use warnings;
use KyotoCabinet;

# construct the database object
my $db = KyotoCabinet::DB->new();

# open the database
if (! $db->open('input_a.kct#pccap=256m', $db->OWRITER | $db->OCREATE)
+) {
    die "open error (db): ", $db->error;
}

# open input_file a
open my $fh_a, '<', 'input_a.txt' or die "open error (txt): $!";

my $key_a = '';
my $val_a = '';

sub db_store {
    # store key-value pair
    if (! $db->set($key_a, $val_a)) {
        warn "db set error: ", $db->error, "\n";
    }
}

while ( my $line = <$fh_a> ) {
    chomp $line;
    my ($key, $val) = split(' ', $line, 2);
    if ($key ne $key_a) {
        db_store() if $key_a;
        $key_a = $key;
        $val_a = '';
    }
    $val_a .= ($val_a) ? " $val" : $val;
}

db_store() if $key_a;

close $fh_a;
$db->close;
[download]

Now the fun part. That would be stage_b.pl (note: direct output to a file). This too is fast and completes in less than a minute on my laptop.

use strict;
use warnings;
use KyotoCabinet;

# construct the database object
my $db = KyotoCabinet::DB->new();

# open the database
if (! $db->open('input_a.kct#pccap=256m', $db->OREADER)) {
    die "open error (db): ", $db->error;
}

# open input_file b
open my $fh_b, '<', 'input_b.txt' or die "open error (txt): $!";

my $key_b = '';
my @val_b = '';

sub output_key {
    # process key
    foreach my $val_a ( split(' ', $db->get($key_b)) ) {
        foreach my $val_b ( @val_b ) {
            print "$key_b $val_a $val_b\n";
        }
    }
}

while ( my $line = <$fh_b> ) {
    chomp $line;
    my ($key, $val) = split(' ', $line, 2);
    if ($key ne $key_b) {
        output_key() if $key_b;
        $key_b = $key;
        @val_b = ();
    }
    push @val_b, $val;
}

output_key() if $key_b;

close $fh_b;
$db->close;
[download]

Files and Output

Here is the ls -lh and wc -l output. There are some big files.

$ ls -lh
total 1705800
-rw-r--r--  1 mario  staff    96M Dec  1 01:17 input_a.kct
-rw-r--r--  1 mario  staff   281B Dec  1 01:15 input_a.pl
-rw-r--r--  1 mario  staff   124M Dec  1 01:15 input_a.txt
-rw-r--r--  1 mario  staff   349B Dec  1 01:15 input_b.pl
-rw-r--r--  1 mario  staff   207M Dec  1 01:16 input_b.txt
-rw-r--r--  1 mario  staff   381M Dec  1 01:18 output.txt
-rw-r--r--  1 mario  staff   797B Dec  1 01:12 stage_a.pl
-rw-r--r--  1 mario  staff   824B Dec  1 01:10 stage_b.pl

$ wc -l *.txt
  9999990 input_a.txt
 16666650 input_b.txt
 26666640 output.txt
 53333280 total
[download]

Finally, the head of output.txt.

000000001D x m
000000001D x n
000000001D x o
000000001D y m
000000001D y n
000000001D y o
000000001E t m
000000001E t s
000000002D x m
000000002D x n
000000002D x o
000000002D y m
000000002D y n
000000002D y o
000000002E t m
000000002E t s
000000003D x m
000000003D x n
000000003D x o
000000003D y m
000000003D y n
000000003D y o
000000003E t m
000000003E t s
...
[download]

One may choose another key-value database for stage_a. I do not understand how 1.7 trillion records is computed for your case. Well, I gave this a try and had to look up Kyoto Cabinet. I'm on the Mac and applied the three patches.

Regards, Mario

Comment on Re: Speed up file write taking weeks Select or Download Code

In Section Seekers of Perl Wisdom