modification of the script to consume less memory with higher speed

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: modification of the script to consume less memory with higher speed by Eily (Monsignor) on Jul 29, 2016 at 09:57 UTC
If you're not going to use $tot when the value array is smaller than the number of input files, you do not need to compute it. You can move the inner for loop inside the `if ( @{ $seen{$key1} } >= $file_count)` block. Anyway, when you get the file count through @ARGV, it is always 0, as the arguments have been shifted out of @ARGV by `while(<>)`. You can speed split a little by limiting the number of time it splits. Either with the LIMIT argument, or writing the output to a list with a defined length. Ex: `(undef, my $key1) = split "\n", $key;` and `$tot += ( split ':', $val, 2)[0]`. For the latter, I wouldn't be surprised if the subscript already limited the number of times splitting occurs but it is not explicitly stated as so in the doc. I'm not sure this will be a significant increase in speed, but your script is simple enough that there's not much that can be done. I'm a little surprised to see that you write $key as the first value in your value array, but without sample data, what you are trying to parse with your script is not very clear. Maybe you can check the list of files with something like: `die "File $_ does not exist" for grep { not -e } @ARGV; # make sure @A +RGV only contains filenames. warn '@ARGV is empty, the program will read from STDIN' unless @ARGV;` [download] The second line will warn you if you ever forget to pass the file list in the parameters, as the script would seem to freeze when it is actually waiting on STDIN.	[reply] [d/l] [select]
Re: modification of the script to consume less memory with higher speed by Laurent_R (Canon) on Jul 29, 2016 at 12:26 UTC
If your files are really large, then they may exceed your available memory when you try to store data in the `%seen` hash. In that case, it might either crash or become painstakingly slow. Please provide an estimate or your files' sizes. Right now you appear to store each file twice in memory, you could at least reduce this to only once, and this might be sufficient to get rid of the problem.	[reply] [d/l]
Re: modification of the script to consume less memory with higher speed by Anonymous Monk on Jul 29, 2016 at 13:04 UTC
Start by describing the problem. How many files, how many paragraphs? What is the significance of `'\t'` in your paragraphs? Must the paragraphs match character-for-character or is only the second line important? Actually, scratch that. Give us a sample of your "paragraph". Describe what you are wanting to do, not how.	[reply] [d/l]
Re^2: modification of the script to consume less memory with higher speed by Anonymous Monk on Jul 30, 2016 at 05:00 UTC
I have multiple fastq files in the following format giving the reads and the number of times each read occur in a file, separated by tab: `data1.txt @NS500278 AGATCNGAAGAGCACACGTCTGAACTCCAGTCACAACGTGATATCTCGTATGCCGTCTTC + =CCGGGCGGG1GGJJCGJJCJJJCJJGGGJJGJGJJJCG8JGJJJJ1JGG8=JGCJGG$G 1 :da +ta1.txt @NS500278 CATTGNACCAAATGTAATCAGCTTTTTTCGTCGTCATTTTTCTTCCTTTTGCGCTCAGGC + CCCGGGGGGGGGGJGJJJJJJJJJJJJJGJG$JJJJ$GGJ>JJJGGG8$CGJJGGCJ8JJ 3 :da +ta1.txt @NS500278 TACAGNGAGCAAACTGAAATGAAAAAGAAATTAATCAGCGGACTGTTTCTGATGTTATGG + CCCGGGGGGGGGGJGJJJJJJJJJJJJJGJG$JJJJ$GGJJJJJGGG8$CGJJGGCJ8JJ 2 :da +ta1.txt` [download] `data2.txt @NS500278 AGATCNGAAGAGCACACGTCTGAACTCCAGTCACAACGTGATATCTCGTATGCCGTCTTC + AAAAA#EEEEEEEEEEEEEEEE6EEEEEAEEEAE/AEEEEEEEAE<EEEEA</AE<EE 1 :data +2.txt @NS500278 CATTGNACCAAATGTAATCAGCTTTTTTCGTCGTCATTTTTCTTCCTTTTGCGCTCAGGC + AAAAA#E/<EEEEEEEEEEAEEEEEEEEA/EAAEEEEEEEEEEEE/EEEE/A6<E<EEE 2 :dat +a2.txt @NS500278 TACAGNGAGCAAACTGAAATGAAAAAGAAATTAATCAGCGGACTGTTTCTGATGTTATGG + AAAAA#EEEEEEEEAEEEEEEEEEEEEEEEEEEEEAEEEEEEEE/EEEAE6AE<EAEEAE 2 :da +ta2.txt` [download] I want to sum the occurrence of read in all the files if the second line of each read matches i.e the output for the above two files should come like this: @NS500278 AGATCNGAAGAGCACACGTCTGAACTCCAGTCACAACGTGATATCTCGTATGCCGTCTTC + =CCGGGCGGG1GGJJCGJJCJJJCJJGGGJJGJGJJJCG8JGJJJJ1JGG8=JGCJGG$G 1 :da +ta1.txt 1 :data2.txt count:2 @NS500278 CATTGNACCAAATGTAATCAGCTTTTTTCGTCGTCATTTTTCTTCCTTTTGCGCTCAGGC + CCCGGGGGGGGGGJGJJJJJJJJJJJJJGJG$JJJJ$GGJ>JJJGGG8$CGJJGGCJ8JJ 3 :da +ta1.txt 2 :data2.txt count:5 @NS500278 TACAGNGAGCAAACTGAAATGAAAAAGAAATTAATCAGCGGACTGTTTCTGATGTTATGG + CCCGGGGGGGGGGJGJJJJJJJJJJJJJGJG$JJJJ$GGJJJJJGGG8$CGJJGGCJ8JJ 2 :da +ta1.txt 2 :data2.txt count:4 [download] My code is working with file upto 10GB but if the file exceed this size it hangs. I want my script to run any size of file. Any help will be appreciated.	[reply] [d/l] [select]
Re^3: modification of the script to consume less memory with higher speed by BrowserUk (Patriarch) on Jul 30, 2016 at 05:13 UTC
My code is working with file upto 10GB but if the file exceed this size it hangs. Is that 10GB, all the files together; or just one of the files? How much memory does your machine have? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :) In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^4: modification of the script to consume less memory with higher speed by Anonymous Monk on Jul 30, 2016 at 06:25 UTC
Re^3: modification of the script to consume less memory with higher speed by ablanke (Monsignor) on Jul 30, 2016 at 13:24 UTC
Hi, since scalability is one of your top priorities consider using a Key Value Database like Redis, MemcacheDB or ...	[reply]
Re^3: modification of the script to consume less memory with higher speed by Anonymous Monk on Jul 30, 2016 at 05:50 UTC
You appear to keep the first record that is seen, in full, while subsequent matching records are only tallied by their count. In that right? Now, the remaining question is, do you want the output records to keep the order in which they are processed, or is it acceptable if they appear in random order? If any output order will do, then the simplest way to process your job is to divide it up in parts. For example, you can dump the records in temporary files, according to first few letters of the key. Lets say, the intermediate files are TACA.tmp, CATT.tmp, AGAT.tmp, etc. After that, process each temp. file individually, appending the output to final result. Questions?	[reply]
Re^4: modification of the script to consume less memory with higher speed by Anonymous Monk on Jul 30, 2016 at 06:22 UTC
Re^5: modification of the script to consume less memory with higher speed by Anonymous Monk on Jul 30, 2016 at 07:46 UTC
Re: modification of the script to consume less memory with higher speed by Anonymous Monk on Aug 02, 2016 at 06:11 UTC
Greetings, Anonymous Monk. Sorting may be the reason why the script is failing, but am not sure at which point the script hangs on your machine. Have you tried not sorting the hash? #!/usr/bin/env perl use strict; use warnings; no warnings qw( numeric ); my $file_count = @ARGV; my %seen; $/ = ""; if ( @ARGV == 0 ) { print "usage: perl $0 file1.txt file2.txt ...\n"; exit 1; } while ( my $key = <> ) { chomp $key; # obtain "\t" position my $tabPos = rindex $key, "\t"; # extract value my $value = substr $key, $tabPos + 1; # obtain key1 my @lines = split /\n/, $key, 3; my $key1 = $lines[1]; $seen{$key1} //= do { # trim "\t" and value from key substr $key, $tabPos, length($value) + 1, ''; [ $key ]; }; push @{ $seen{$key1} }, $value; } my $tot; # sorting requires 2x memory allocation and possibly # exhaust available memory, hang, or crash # foreach my $key1 ( sort keys %seen ) { # if ( @{ $seen{$key1} } >= $file_count ) { # $tot = 0; # for my $val ( @{ $seen{$key1} } ) { # # $tot += ( split /:/, $val )[0]; # $tot += $val; # Perl ignores the string after number # } # print join "\t", @{ $seen{$key1} }; # print "\tcount:". $tot."\n\n"; # } # } # try this instead for lesser memory consumption while ( my ( $key1, $aref ) = each %seen ) { if ( @{ $aref } >= $file_count ) { $tot = 0; for my $val ( @{ $aref } ) { # $tot += ( split /:/, $val )[0]; $tot += $val; # Perl ignores the string after number } print join "\t", @{ $aref }; print "\tcount:". $tot."\n\n"; } } [download] Regards.	[reply] [d/l]
Re^2: modification of the script to consume less memory with higher speed by Anonymous Monk on Aug 02, 2016 at 06:39 UTC
For this particular use case, I saw lesser memory consumption using the following for output. Your mileage may vary. This is without sorting. `my $tot; foreach my $key1 ( keys %seen ) { if ( @{ $seen{$key1} } >= $file_count ) { $tot = 0; for my $val ( @{ $seen{$key1} } ) { # $tot += ( split /:/, $val )[0]; $tot += $val; # Perl ignores the string after number } print join "\t", @{ $seen{$key1} }; print "\tcount:". $tot."\n\n"; } }` [download] Regards.	[reply] [d/l]


Come for the quick hacks, stay for the epiphanies.
	PerlMonks