Speed up file write taking weeks

Sanjay has asked for the wisdom of the Perl Monks concerning the following question:

We are matching two flat files and making a join, e.g.

Input_File_A has two fields sorted on PK_A_Field:

PK_A_Field    Data_A_Field
       D                     x
       D                     y
       E                      t
[download]

Input_File_B has two fields sorted on PK_B_Field:

PK_B_Field    Data_B_Field
       D                     m
       D                     n
       D                     o
       E                     m
       E                     s
[download]

Output file has three fields

PK_Field     Data_A_Field    Data_B_Field
     D                      x                     m
     D                      x                     n
     D                      x                     o
     D                      y                     m
     D                      y                     n
     D                      y                     o
     E                       t                     m
     E                       t                     s
[download]

i.e. some kind of cartesian join. The input files contain 65 million and 72 million records. The output file has 1.7 trillion records. Program is taking weeks, albeit on a low end server. We used Devel::NYTProf - the biggest chunk of time is writing the output record.

Any way to reduce this time?

Would using a queue in shifting output record to another process or set of round robin processes work? Something like IBM MQ or RabbitMQ?

We tried some measures like use integer; IO::File, and pare the main processing loop. Also, splitting the input files & running on multiple machines in parallel. Nothing much achieved as yet - after all it is a simple single pass program (we store data of the same key of one file in an array & write out matching records of the other file).

By the way, to manage the huge disk resources needed, we close the file every 100 million records, compress it & delete the "until-then" file created. Open the next output file with an increment in the file name. Onward processing on another machine.

2019-07-03 Athanasius added code tags

Comment on Speed up file write taking weeks
Select or Download Code

Replies are listed 'Best First'.
Re: Speed up file write taking weeks by dave_the_m (Monsignor) on Jun 29, 2019 at 14:56 UTC
If we assume that writing 1.7 trillion records takes two weeks, then that's 1.4 million records per second, which is a reasonable number, assuming that the fields are of a decent size. So while you might be able to speed up things a bit, I'd be surprised if you could reduce it from weeks to days. You'd have to paraellelise it, with multiple disks and CPUs Dave.	[reply]
Re: Speed up file write taking weeks by marto (Cardinal) on Jun 29, 2019 at 16:02 UTC
"Program is taking weeks, albeit on a low end server. We used Devel::NYTProf - the biggest chunk of time is writing the output record." "Any way to reduce this time?" Without seeing code the only thing I'd add (or rather second) would be, since you care about how fast this happens, don't run it on a "low end" server. Shared servers (VPS etc) often don't give you the performance advertised. PCIe 4 SSDs are just around the corner (a couple of weeks IIRC), up to 5GB/s.	[reply]
Re: Speed up file write taking weeks by holli (Abbot) on Jun 29, 2019 at 14:48 UTC
By the way, to manage the huge disk resources needed, we close the file every 100 million records, compress it & delete the "until-then" file created. I think that is taking quite some of the time. Did you consider writing directly into a zip-archive? Some of the available modules do have low level methods that let you add data in chunks, IO::Compress::Zip for instance. They don't have to work on files on disk. Also, there is of course the easy option of throwing money at the problem, here in form of a faster HDD. holli You can lead your users to water, but alas, you cannot drown them.	[reply]
Re: Speed up file write taking weeks by LanX (Saint) on Jun 29, 2019 at 23:46 UTC
> . The output file has 1.7 trillion records Sorry but the whole concept is questionable. You are multiplying two files to effectively waste disc space with highly redundant information. I'd rather consider putting those two sources in a DB (or something similar) and to join them on demand. If the target consumer really needs files try faking them with a filesystem facade which interfaces the DB. I wouldn't be surprised if this bridge was faster than accessing a huge static file on a hard disk. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery FootballPerl is like chess, only without the dice}	[reply]
Re^2: Speed up file write taking weeks by Sanjay (Sexton) on Jul 01, 2019 at 06:28 UTC
Thank you for your response. We tried Oracle & then SQLite. Ran out of disk space or system hanging. Reluctantly adopted this method. At least it's inching somewhere. BTW, the problem presented is a simplified one & part of a larger solution. We guess that the 1.7 trillion records will generate around 100 million unique records (process still going on).	[reply]
Re^3: Speed up file write taking weeks by dave_the_m (Monsignor) on Jul 01, 2019 at 10:55 UTC
Generating trillions of records from initial data containing millions of records, then deduplicating that back to millions of records sounds like a really bad idea. If you gave us a better idea of what data you start with (how many rows, columns, how many unique keys, average size of a record etc), and what the final result is that you want, we might be able to come up with some suggestions. Dave.	[reply]
Re^4: Speed up file write taking weeks by Sanjay (Sexton) on Jul 02, 2019 at 13:32 UTC
Re^5: Speed up file write taking weeks by dave_the_m (Monsignor) on Jul 02, 2019 at 14:12 UTC
Re^5: Speed up file write taking weeks by Corion (Patriarch) on Jul 02, 2019 at 13:40 UTC
Re^3: Speed up file write taking weeks by Marshall (Canon) on Jul 02, 2019 at 00:43 UTC
You wrote: "The input files contain 65 million and 72 million records. The output file has 1.7 trillion records." And then, "We guess that the 1.7 trillion records will generate around 100 million unique records". If the final result is the generation of these 100 million "unique records" (whatever that means), what is your plan for doing that from this humongous flat file of 1.7 trillion records? A factor of millions is a lot! It is plausible to have an SQL DB with 65 + 72 million records. If those 2 tables combine to produce a smaller table (less than the sum of the input rows) of 100 million, I suspect there is a much more efficient algorithm to do that. However, I just don't know enough about what you are doing! My gosh what will you do with this 1.7 trillion record file after you generate it? How will you arrive at the 100 million unique records?	[reply]
Re^4: Speed up file write taking weeks by Sanjay (Sexton) on Nov 22, 2019 at 16:00 UTC
Re^5: Speed up file write taking weeks by Marshall (Canon) on Nov 30, 2019 at 01:46 UTC
Re^3: Speed up file write taking weeks by karlgoethebier (Abbot) on Jul 02, 2019 at 19:22 UTC
"...out of...space..." This has nothing to do with the database you choose. As others mentioned already: Upgrade your hardware. You can�t manage a big database with poor hardware. You may also consider partitioning of your tables. Regards, Karl �The Crux of the Biscuit is the Apostrophe� `perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'`Help	[reply] [d/l]
Re: Speed up file write taking weeks by marioroy (Prior) on Dec 01, 2019 at 06:55 UTC
Update 1: Added install instructions Update 2: Updated stage_a.pl, stage_b.pl Hi Sanjay, For something like this, one might want to do this in stages. But first, I like to introduce to you a fast database named Kyoto Cabinet. It is quite fast. See more info here, on PerlMonks. Installation Kyoto Cabinet site, source, and Perl module. https://fallabs.com/kyotocabinet/ https://fallabs.com/kyotocabinet/pkg/ https://fallabs.com/kyotocabinet/perlpkg/ https://fallabs.com/kyotocabinet/perldoc/ MacPorts patch files. http://ftp.mirrorservice.org/sites/rsync.macports.org/release/ports/databases/kyotocabinet/files/ More info can be found here (i.e. tuning parameters after the file name). https://github.com/alticelabs/kyoto `$ tar xf $HOME/Downloads/kyotocabinet-1.2.77.tar.gz $ cd kyotocabinet-1.2.77 $ patch -p0 < $HOME/Downloads/patch-configure.diff $ patch -p0 < $HOME/Downloads/patch-kcthread.cc $ patch -p0 < $HOME/Downloads/patch-kccommon.h.diff $ ./configure --disable-lzo --disable-lzma $ make -j2 $ sudo make install $ cd ..` [download] `$ tar xf $HOME/Downloads/kyotocabinet-perl-1.20.tar.gz $ cd kyotocabinet-perl-1.20 $ perl Makefile.PL $ make $ make test $ sudo make install $ cd doc $ open index.html in a web browser $ cd ../..` [download] Demonstration I created approximately 10 and 17 million records for input_a.txt and input_b.txt, respectively, using the following scripts. The input files remain sorted. `use strict; use warnings; open my $fh_a, '>', 'input_a.txt' or die "open error: $!"; foreach my $i ( 1 .. 333_333 * 10 ) { my $prefix = sprintf('%09s', $i); print $fh_a "${prefix}D x\n"; print $fh_a "${prefix}D y\n"; print $fh_a "${prefix}E t\n"; } close $fh_a;` [download] `use strict; use warnings; open my $fh_b, '>', 'input_b.txt' or die "open error: $!"; foreach my $i ( 1 .. 333_333 * 10 ) { my $prefix = sprintf('%09s', $i); print $fh_b "${prefix}D m\n"; print $fh_b "${prefix}D n\n"; print $fh_b "${prefix}D o\n"; print $fh_b "${prefix}E m\n"; print $fh_b "${prefix}E s\n"; } close $fh_b;` [download] Afterwards, run stage_a.pl to create the Kyoto Cabinet database. This is quite fast and stores 10 million records in little time. use strict; use warnings; use KyotoCabinet; # construct the database object my $db = KyotoCabinet::DB->new(); # open the database if (! $db->open('input_a.kct#pccap=256m', $db->OWRITER \| $db->OCREATE) +) { die "open error (db): ", $db->error; } # open input_file a open my $fh_a, '<', 'input_a.txt' or die "open error (txt): $!"; my $key_a = ''; my $val_a = ''; sub db_store { # store key-value pair if (! $db->set($key_a, $val_a)) { warn "db set error: ", $db->error, "\n"; } } while ( my $line = <$fh_a> ) { chomp $line; my ($key, $val) = split(' ', $line, 2); if ($key ne $key_a) { db_store() if $key_a; $key_a = $key; $val_a = ''; } $val_a .= ($val_a) ? " $val" : $val; } db_store() if $key_a; close $fh_a; $db->close; [download] Now the fun part. That would be stage_b.pl (note: direct output to a file). This too is fast and completes in less than a minute on my laptop. use strict; use warnings; use KyotoCabinet; # construct the database object my $db = KyotoCabinet::DB->new(); # open the database if (! $db->open('input_a.kct#pccap=256m', $db->OREADER)) { die "open error (db): ", $db->error; } # open input_file b open my $fh_b, '<', 'input_b.txt' or die "open error (txt): $!"; my $key_b = ''; my @val_b = ''; sub output_key { # process key foreach my $val_a ( split(' ', $db->get($key_b)) ) { foreach my $val_b ( @val_b ) { print "$key_b $val_a $val_b\n"; } } } while ( my $line = <$fh_b> ) { chomp $line; my ($key, $val) = split(' ', $line, 2); if ($key ne $key_b) { output_key() if $key_b; $key_b = $key; @val_b = (); } push @val_b, $val; } output_key() if $key_b; close $fh_b; $db->close; [download] Files and Output Here is the ls -lh and wc -l output. There are some big files. $ ls -lh total 1705800 -rw-r--r-- 1 mario staff 96M Dec 1 01:17 input_a.kct -rw-r--r-- 1 mario staff 281B Dec 1 01:15 input_a.pl -rw-r--r-- 1 mario staff 124M Dec 1 01:15 input_a.txt -rw-r--r-- 1 mario staff 349B Dec 1 01:15 input_b.pl -rw-r--r-- 1 mario staff 207M Dec 1 01:16 input_b.txt -rw-r--r-- 1 mario staff 381M Dec 1 01:18 output.txt -rw-r--r-- 1 mario staff 797B Dec 1 01:12 stage_a.pl -rw-r--r-- 1 mario staff 824B Dec 1 01:10 stage_b.pl $ wc -l *.txt 9999990 input_a.txt 16666650 input_b.txt 26666640 output.txt 53333280 total [download] Finally, the head of output.txt. `000000001D x m 000000001D x n 000000001D x o 000000001D y m 000000001D y n 000000001D y o 000000001E t m 000000001E t s 000000002D x m 000000002D x n 000000002D x o 000000002D y m 000000002D y n 000000002D y o 000000002E t m 000000002E t s 000000003D x m 000000003D x n 000000003D x o 000000003D y m 000000003D y n 000000003D y o 000000003E t m 000000003E t s ...` [download] One may choose another key-value database for stage_a. I do not understand how 1.7 trillion records is computed for your case. Well, I gave this a try and had to look up Kyoto Cabinet. I'm on the Mac and applied the three patches. Regards, Mario	[reply] [d/l] [select]