Re^2: Speed up file write taking weeks

Replies are listed 'Best First'.
Re^3: Speed up file write taking weeks by dave_the_m (Monsignor) on Jul 01, 2019 at 10:55 UTC
Generating trillions of records from initial data containing millions of records, then deduplicating that back to millions of records sounds like a really bad idea. If you gave us a better idea of what data you start with (how many rows, columns, how many unique keys, average size of a record etc), and what the final result is that you want, we might be able to come up with some suggestions. Dave.	[reply]
Re^4: Speed up file write taking weeks by Sanjay (Sexton) on Jul 02, 2019 at 13:32 UTC
Let us move from here to the original problem Two tables used as "input" to an SQL: 1. T_1_DAT ( t_1_id* , t_1_data_1 ) - 65 million rows - * = primary key 2. T_2_KEY ( t_2_id@ , t_2_data_2 ) - 72 million rows - @ = part of primary key and an index on this field alone SQL `insert in table T_3_XYZ ( t_3_id , t_3_data_1 , t_3_data_2 ) select distinct t_1_id , t_1_data_1 , t_2_data_2 from T_1_DAT , T_2_KEY where t_1_id = t_2_id` [download] Works for databases with multi column distinct facility. 1.7 trillion rows generated if the unique is removed. We are guessing 100 million if unique. Now that we are doing it the flat file method, we are "joining" the two files, generating the 1.7 trillion output in instalments. After this we will sort it & "coalesce" it - in stages I guess!	[reply] [d/l]
Re^5: Speed up file write taking weeks by dave_the_m (Monsignor) on Jul 02, 2019 at 14:12 UTC
Well I can't comment about improving the SQL since that's not my expertise, but starting from flat files and using perl, I suggest: Sort both input files by key (either when exporting them from the DB or as a post-export step. Then get perl to read the first line from each of the files and extract its key. If the keys don't match, read another line from the file whose key sorts earlier. Once you have identical keys from both files, read in lines from both files while they still have the same key. At this point you will have extracted two lists of data values. From your original example, for the key 'D' they will be (x, y) and (m,n,o). De-duplicate each list individually (assuming that such duplicates are possible) using a pair of hashes. Then calculate the cartesian product of those two lists, and use a third hash to de-duplicate. Then output to a third file a series of lines containing the same key and the various data pairs. Then go back to step reading new lines from each file. Dave.	[reply]
Re^5: Speed up file write taking weeks by Corion (Patriarch) on Jul 02, 2019 at 13:40 UTC
Does the performance improve if you change the query to use proper JOINs instead? `insert in table T_3_XYZ ( t_3_id , t_3_data_1 , t_3_data_2 ) select distinct t_1_id , t_1_data_1 , t_2_data_2 from T_1_DAT inner join T_2_KEY on t_1_id = t_2_id` [download] Ideally the query optimizer would recognize that the original WHERE clause is identical to the INNER JOIN clause, but maybe it doesn't?	[reply] [d/l]
Re^3: Speed up file write taking weeks by Marshall (Canon) on Jul 02, 2019 at 00:43 UTC
You wrote: "The input files contain 65 million and 72 million records. The output file has 1.7 trillion records." And then, "We guess that the 1.7 trillion records will generate around 100 million unique records". If the final result is the generation of these 100 million "unique records" (whatever that means), what is your plan for doing that from this humongous flat file of 1.7 trillion records? A factor of millions is a lot! It is plausible to have an SQL DB with 65 + 72 million records. If those 2 tables combine to produce a smaller table (less than the sum of the input rows) of 100 million, I suspect there is a much more efficient algorithm to do that. However, I just don't know enough about what you are doing! My gosh what will you do with this 1.7 trillion record file after you generate it? How will you arrive at the 100 million unique records?	[reply]
Re^4: Speed up file write taking weeks by Sanjay (Sexton) on Nov 22, 2019 at 16:00 UTC
Summarise based on another field	[reply]
Re^5: Speed up file write taking weeks by Marshall (Canon) on Nov 30, 2019 at 01:46 UTC
It is plausible to have 2 tables, one with 65 million records and one with 72 million records and, generate a resulting table of 100 million records without having to build a multi-trillion line intermediate table.	[reply]
Re^3: Speed up file write taking weeks by karlgoethebier (Abbot) on Jul 02, 2019 at 19:22 UTC
"...out of...space..." This has nothing to do with the database you choose. As others mentioned already: Upgrade your hardware. You can‘t manage a big database with poor hardware. You may also consider partitioning of your tables. Regards, Karl «The Crux of the Biscuit is the Apostrophe» `perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'`Help	[reply] [d/l]


Syntactic Confectionery Delight
	PerlMonks