Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

modification of the script to consume less memory with higher speed

by Anonymous Monk
on Jul 29, 2016 at 05:34 UTC ( [id://1168790]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have written a script which compares multiple files and give the number of occurrence of each paragraph in each file. The script is working fine with smaller files but when applied to large files the program is stuck with no output. I need some help in modifying the script so that it can run on all files even if its very large. My script:
#!/usr/bin/env perl use strict; use warnings; my %seen; $/ = ""; while (<>) { chomp; my ($key, $value) = split ('\t', $_); my @lines = split /\n/, $key; my $key1 = $lines[1]; $seen{$key1} //= [ $key ]; push (@{$seen{$key1}}, $value); } foreach my $key1 ( sort keys %seen ) { my $tot = 0; my $file_count = @ARGV; for my $val ( @{$seen{$key1}} ) { $tot += ( split /:/, $val )[0]; } if ( @{ $seen{$key1} } >= $file_count) { print join( "\t", @{$seen{$key1}}); print "\tcount:". $tot."\n\n"; } }
please help me as soon as possible
  • Comment on modification of the script to consume less memory with higher speed
  • Download Code

Replies are listed 'Best First'.
Re: modification of the script to consume less memory with higher speed
by Eily (Monsignor) on Jul 29, 2016 at 09:57 UTC

    If you're not going to use $tot when the value array is smaller than the number of input files, you do not need to compute it. You can move the inner for loop inside the if ( @{ $seen{$key1} } >= $file_count) block.
    Anyway, when you get the file count through @ARGV, it is always 0, as the arguments have been shifted out of @ARGV by while(<>).

    You can speed split a little by limiting the number of time it splits. Either with the LIMIT argument, or writing the output to a list with a defined length. Ex: (undef, my $key1) = split "\n", $key; and $tot += ( split ':', $val, 2)[0]. For the latter, I wouldn't be surprised if the subscript already limited the number of times splitting occurs but it is not explicitly stated as so in the doc. I'm not sure this will be a significant increase in speed, but your script is simple enough that there's not much that can be done.

    I'm a little surprised to see that you write $key as the first value in your value array, but without sample data, what you are trying to parse with your script is not very clear.

    Maybe you can check the list of files with something like:

    die "File $_ does not exist" for grep { not -e } @ARGV; # make sure @A +RGV only contains filenames. warn '@ARGV is empty, the program will read from STDIN' unless @ARGV;
    The second line will warn you if you ever forget to pass the file list in the parameters, as the script would seem to freeze when it is actually waiting on STDIN.

Re: modification of the script to consume less memory with higher speed
by Laurent_R (Canon) on Jul 29, 2016 at 12:26 UTC
    If your files are really large, then they may exceed your available memory when you try to store data in the %seen hash. In that case, it might either crash or become painstakingly slow.

    Please provide an estimate or your files' sizes. Right now you appear to store each file twice in memory, you could at least reduce this to only once, and this might be sufficient to get rid of the problem.

Re: modification of the script to consume less memory with higher speed
by Anonymous Monk on Jul 29, 2016 at 13:04 UTC

    Start by describing the problem. How many files, how many paragraphs? What is the significance of '\t' in your paragraphs? Must the paragraphs match character-for-character or is only the second line important? Actually, scratch that.

    Give us a sample of your "paragraph". Describe what you are wanting to do, not how.

      I have multiple fastq files in the following format giving the reads and the number of times each read occur in a file, separated by tab:
      data1.txt @NS500278 AGATCNGAAGAGCACACGTCTGAACTCCAGTCACAACGTGATATCTCGTATGCCGTCTTC + =CCGGGCGGG1GGJJCGJJCJJJCJJGGGJJGJGJJJCG8JGJJJJ1JGG8=JGCJGG$G 1 :da +ta1.txt @NS500278 CATTGNACCAAATGTAATCAGCTTTTTTCGTCGTCATTTTTCTTCCTTTTGCGCTCAGGC + CCCGGGGGGGGGGJGJJJJJJJJJJJJJGJG$JJJJ$GGJ>JJJGGG8$CGJJGGCJ8JJ 3 :da +ta1.txt @NS500278 TACAGNGAGCAAACTGAAATGAAAAAGAAATTAATCAGCGGACTGTTTCTGATGTTATGG + CCCGGGGGGGGGGJGJJJJJJJJJJJJJGJG$JJJJ$GGJJJJJGGG8$CGJJGGCJ8JJ 2 :da +ta1.txt
      data2.txt @NS500278 AGATCNGAAGAGCACACGTCTGAACTCCAGTCACAACGTGATATCTCGTATGCCGTCTTC + AAAAA#EEEEEEEEEEEEEEEE6EEEEEAEEEAE/AEEEEEEEAE<EEEEA</AE<EE 1 :data +2.txt @NS500278 CATTGNACCAAATGTAATCAGCTTTTTTCGTCGTCATTTTTCTTCCTTTTGCGCTCAGGC + AAAAA#E/<EEEEEEEEEEAEEEEEEEEA/EAAEEEEEEEEEEEE/EEEE/A6<E<EEE 2 :dat +a2.txt @NS500278 TACAGNGAGCAAACTGAAATGAAAAAGAAATTAATCAGCGGACTGTTTCTGATGTTATGG + AAAAA#EEEEEEEEAEEEEEEEEEEEEEEEEEEEEAEEEEEEEE/EEEAE6AE<EAEEAE 2 :da +ta2.txt
      I want to sum the occurrence of read in all the files if the second line of each read matches i.e the output for the above two files should come like this:
      @NS500278 AGATCNGAAGAGCACACGTCTGAACTCCAGTCACAACGTGATATCTCGTATGCCGTCTTC + =CCGGGCGGG1GGJJCGJJCJJJCJJGGGJJGJGJJJCG8JGJJJJ1JGG8=JGCJGG$G 1 :da +ta1.txt 1 :data2.txt count:2 @NS500278 CATTGNACCAAATGTAATCAGCTTTTTTCGTCGTCATTTTTCTTCCTTTTGCGCTCAGGC + CCCGGGGGGGGGGJGJJJJJJJJJJJJJGJG$JJJJ$GGJ>JJJGGG8$CGJJGGCJ8JJ 3 :da +ta1.txt 2 :data2.txt count:5 @NS500278 TACAGNGAGCAAACTGAAATGAAAAAGAAATTAATCAGCGGACTGTTTCTGATGTTATGG + CCCGGGGGGGGGGJGJJJJJJJJJJJJJGJG$JJJJ$GGJJJJJGGG8$CGJJGGCJ8JJ 2 :da +ta1.txt 2 :data2.txt count:4
      My code is working with file upto 10GB but if the file exceed this size it hangs. I want my script to run any size of file. Any help will be appreciated.
        My code is working with file upto 10GB but if the file exceed this size it hangs.

        Is that 10GB, all the files together; or just one of the files?

        How much memory does your machine have?


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
        In the absence of evidence, opinion is indistinguishable from prejudice.
        Hi,

        since scalability is one of your top priorities consider using a Key Value Database like Redis, MemcacheDB or ...

        You appear to keep the first record that is seen, in full, while subsequent matching records are only tallied by their count. In that right?

        Now, the remaining question is, do you want the output records to keep the order in which they are processed, or is it acceptable if they appear in random order?

        If any output order will do, then the simplest way to process your job is to divide it up in parts. For example, you can dump the records in temporary files, according to first few letters of the key. Lets say, the intermediate files are TACA.tmp, CATT.tmp, AGAT.tmp, etc. After that, process each temp. file individually, appending the output to final result. Questions?

Re: modification of the script to consume less memory with higher speed
by Anonymous Monk on Aug 02, 2016 at 06:11 UTC
    Greetings, Anonymous Monk.

    Sorting may be the reason why the script is failing, but am not sure at which point the script hangs on your machine. Have you tried not sorting the hash?

    #!/usr/bin/env perl use strict; use warnings; no warnings qw( numeric ); my $file_count = @ARGV; my %seen; $/ = ""; if ( @ARGV == 0 ) { print "usage: perl $0 file1.txt file2.txt ...\n"; exit 1; } while ( my $key = <> ) { chomp $key; # obtain "\t" position my $tabPos = rindex $key, "\t"; # extract value my $value = substr $key, $tabPos + 1; # obtain key1 my @lines = split /\n/, $key, 3; my $key1 = $lines[1]; $seen{$key1} //= do { # trim "\t" and value from key substr $key, $tabPos, length($value) + 1, ''; [ $key ]; }; push @{ $seen{$key1} }, $value; } my $tot; # sorting requires 2x memory allocation and possibly # exhaust available memory, hang, or crash # foreach my $key1 ( sort keys %seen ) { # if ( @{ $seen{$key1} } >= $file_count ) { # $tot = 0; # for my $val ( @{ $seen{$key1} } ) { # # $tot += ( split /:/, $val )[0]; # $tot += $val; # Perl ignores the string after number # } # print join "\t", @{ $seen{$key1} }; # print "\tcount:". $tot."\n\n"; # } # } # try this instead for lesser memory consumption while ( my ( $key1, $aref ) = each %seen ) { if ( @{ $aref } >= $file_count ) { $tot = 0; for my $val ( @{ $aref } ) { # $tot += ( split /:/, $val )[0]; $tot += $val; # Perl ignores the string after number } print join "\t", @{ $aref }; print "\tcount:". $tot."\n\n"; } }

    Regards.

      For this particular use case, I saw lesser memory consumption using the following for output. Your mileage may vary. This is without sorting.

      my $tot; foreach my $key1 ( keys %seen ) { if ( @{ $seen{$key1} } >= $file_count ) { $tot = 0; for my $val ( @{ $seen{$key1} } ) { # $tot += ( split /:/, $val )[0]; $tot += $val; # Perl ignores the string after number } print join "\t", @{ $seen{$key1} }; print "\tcount:". $tot."\n\n"; } }

      Regards.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1168790]
Approved by beech
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (None)
    As of 2024-04-25 01:05 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      No recent polls found