Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
I have written a script which compares multiple files and give the number of occurrence of each paragraph in each file.
The script is working fine with smaller files but when applied to large files the program is stuck with no output.
I need some help in modifying the script so that it can run on all files even if its very large.
My script:
#!/usr/bin/env perl
use strict;
use warnings;
my %seen;
$/ = "";
while (<>) {
chomp;
my ($key, $value) = split ('\t', $_);
my @lines = split /\n/, $key;
my $key1 = $lines[1];
$seen{$key1} //= [ $key ];
push (@{$seen{$key1}}, $value);
}
foreach my $key1 ( sort keys %seen ) {
my $tot = 0;
my $file_count = @ARGV;
for my $val ( @{$seen{$key1}} ) {
$tot += ( split /:/, $val )[0];
}
if ( @{ $seen{$key1} } >= $file_count) {
print join( "\t", @{$seen{$key1}});
print "\tcount:". $tot."\n\n";
}
}
please help me as soon as possible
Re: modification of the script to consume less memory with higher speed
by Eily (Monsignor) on Jul 29, 2016 at 09:57 UTC
|
If you're not going to use $tot when the value array is smaller than the number of input files, you do not need to compute it. You can move the inner for loop inside the if ( @{ $seen{$key1} } >= $file_count) block.
Anyway, when you get the file count through @ARGV, it is always 0, as the arguments have been shifted out of @ARGV by while(<>).
You can speed split a little by limiting the number of time it splits. Either with the LIMIT argument, or writing the output to a list with a defined length. Ex: (undef, my $key1) = split "\n", $key; and $tot += ( split ':', $val, 2)[0]. For the latter, I wouldn't be surprised if the subscript already limited the number of times splitting occurs but it is not explicitly stated as so in the doc. I'm not sure this will be a significant increase in speed, but your script is simple enough that there's not much that can be done.
I'm a little surprised to see that you write $key as the first value in your value array, but without sample data, what you are trying to parse with your script is not very clear.
Maybe you can check the list of files with something like:
die "File $_ does not exist" for grep { not -e } @ARGV; # make sure @A
+RGV only contains filenames.
warn '@ARGV is empty, the program will read from STDIN' unless @ARGV;
The second line will warn you if you ever forget to pass the file list in the parameters, as the script would seem to freeze when it is actually waiting on STDIN. | [reply] [d/l] [select] |
Re: modification of the script to consume less memory with higher speed
by Laurent_R (Canon) on Jul 29, 2016 at 12:26 UTC
|
If your files are really large, then they may exceed your available memory when you try to store data in the %seen hash. In that case, it might either crash or become painstakingly slow.
Please provide an estimate or your files' sizes. Right now you appear to store each file twice in memory, you could at least reduce this to only once, and this might be sufficient to get rid of the problem.
| [reply] [d/l] |
Re: modification of the script to consume less memory with higher speed
by Anonymous Monk on Jul 29, 2016 at 13:04 UTC
|
Start by describing the problem. How many files, how many paragraphs? What is the significance of '\t' in your paragraphs? Must the paragraphs match character-for-character or is only the second line important? Actually, scratch that.
Give us a sample of your "paragraph". Describe what you are wanting to do, not how.
| [reply] [d/l] |
|
I have multiple fastq files in the following format giving the reads and the number of times each read occur in a file, separated by tab:
data1.txt
@NS500278
AGATCNGAAGAGCACACGTCTGAACTCCAGTCACAACGTGATATCTCGTATGCCGTCTTC
+
=CCGGGCGGG1GGJJCGJJCJJJCJJGGGJJGJGJJJCG8JGJJJJ1JGG8=JGCJGG$G 1 :da
+ta1.txt
@NS500278
CATTGNACCAAATGTAATCAGCTTTTTTCGTCGTCATTTTTCTTCCTTTTGCGCTCAGGC
+
CCCGGGGGGGGGGJGJJJJJJJJJJJJJGJG$JJJJ$GGJ>JJJGGG8$CGJJGGCJ8JJ 3 :da
+ta1.txt
@NS500278
TACAGNGAGCAAACTGAAATGAAAAAGAAATTAATCAGCGGACTGTTTCTGATGTTATGG
+
CCCGGGGGGGGGGJGJJJJJJJJJJJJJGJG$JJJJ$GGJJJJJGGG8$CGJJGGCJ8JJ 2 :da
+ta1.txt
data2.txt
@NS500278
AGATCNGAAGAGCACACGTCTGAACTCCAGTCACAACGTGATATCTCGTATGCCGTCTTC
+
AAAAA#EEEEEEEEEEEEEEEE6EEEEEAEEEAE/AEEEEEEEAE<EEEEA</AE<EE 1 :data
+2.txt
@NS500278
CATTGNACCAAATGTAATCAGCTTTTTTCGTCGTCATTTTTCTTCCTTTTGCGCTCAGGC
+
AAAAA#E/<EEEEEEEEEEAEEEEEEEEA/EAAEEEEEEEEEEEE/EEEE/A6<E<EEE 2 :dat
+a2.txt
@NS500278
TACAGNGAGCAAACTGAAATGAAAAAGAAATTAATCAGCGGACTGTTTCTGATGTTATGG
+
AAAAA#EEEEEEEEAEEEEEEEEEEEEEEEEEEEEAEEEEEEEE/EEEAE6AE<EAEEAE 2 :da
+ta2.txt
I want to sum the occurrence of read in all the files if the second line of each read matches i.e the output for the above two files should come like this:
@NS500278
AGATCNGAAGAGCACACGTCTGAACTCCAGTCACAACGTGATATCTCGTATGCCGTCTTC
+
=CCGGGCGGG1GGJJCGJJCJJJCJJGGGJJGJGJJJCG8JGJJJJ1JGG8=JGCJGG$G 1 :da
+ta1.txt 1 :data2.txt count:2
@NS500278
CATTGNACCAAATGTAATCAGCTTTTTTCGTCGTCATTTTTCTTCCTTTTGCGCTCAGGC
+
CCCGGGGGGGGGGJGJJJJJJJJJJJJJGJG$JJJJ$GGJ>JJJGGG8$CGJJGGCJ8JJ 3 :da
+ta1.txt 2 :data2.txt count:5
@NS500278
TACAGNGAGCAAACTGAAATGAAAAAGAAATTAATCAGCGGACTGTTTCTGATGTTATGG
+
CCCGGGGGGGGGGJGJJJJJJJJJJJJJGJG$JJJJ$GGJJJJJGGG8$CGJJGGCJ8JJ 2 :da
+ta1.txt 2 :data2.txt count:4
My code is working with file upto 10GB but if the file exceed this size it hangs. I want my script to run any size of file. Any help will be appreciated. | [reply] [d/l] [select] |
|
My code is working with file upto 10GB but if the file exceed this size it hangs.
Is that 10GB, all the files together; or just one of the files?
How much memory does your machine have?
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
|
| [reply] |
|
You appear to keep the first record that is seen, in full, while subsequent matching records are only tallied by their count. In that right?
Now, the remaining question is, do you want the output records to keep the order in which they are processed, or is it acceptable if they appear in random order?
If any output order will do, then the simplest way to process your job is to divide it up in parts. For example, you can dump the records in temporary files, according to first few letters of the key. Lets say, the intermediate files are TACA.tmp, CATT.tmp, AGAT.tmp, etc. After that, process each temp. file individually, appending the output to final result. Questions?
| [reply] |
|
|
Re: modification of the script to consume less memory with higher speed
by Anonymous Monk on Aug 02, 2016 at 06:11 UTC
|
Greetings, Anonymous Monk.
Sorting may be the reason why the script is failing, but am not sure at which point the script hangs on your machine. Have you tried not sorting the hash?
#!/usr/bin/env perl
use strict;
use warnings;
no warnings qw( numeric );
my $file_count = @ARGV;
my %seen;
$/ = "";
if ( @ARGV == 0 ) {
print "usage: perl $0 file1.txt file2.txt ...\n";
exit 1;
}
while ( my $key = <> ) {
chomp $key;
# obtain "\t" position
my $tabPos = rindex $key, "\t";
# extract value
my $value = substr $key, $tabPos + 1;
# obtain key1
my @lines = split /\n/, $key, 3;
my $key1 = $lines[1];
$seen{$key1} //= do {
# trim "\t" and value from key
substr $key, $tabPos, length($value) + 1, '';
[ $key ];
};
push @{ $seen{$key1} }, $value;
}
my $tot;
# sorting requires 2x memory allocation and possibly
# exhaust available memory, hang, or crash
# foreach my $key1 ( sort keys %seen ) {
# if ( @{ $seen{$key1} } >= $file_count ) {
# $tot = 0;
# for my $val ( @{ $seen{$key1} } ) {
# # $tot += ( split /:/, $val )[0];
# $tot += $val; # Perl ignores the string after number
# }
# print join "\t", @{ $seen{$key1} };
# print "\tcount:". $tot."\n\n";
# }
# }
# try this instead for lesser memory consumption
while ( my ( $key1, $aref ) = each %seen ) {
if ( @{ $aref } >= $file_count ) {
$tot = 0;
for my $val ( @{ $aref } ) {
# $tot += ( split /:/, $val )[0];
$tot += $val; # Perl ignores the string after number
}
print join "\t", @{ $aref };
print "\tcount:". $tot."\n\n";
}
}
Regards.
| [reply] [d/l] |
|
my $tot;
foreach my $key1 ( keys %seen ) {
if ( @{ $seen{$key1} } >= $file_count ) {
$tot = 0;
for my $val ( @{ $seen{$key1} } ) {
# $tot += ( split /:/, $val )[0];
$tot += $val; # Perl ignores the string after number
}
print join "\t", @{ $seen{$key1} };
print "\tcount:". $tot."\n\n";
}
}
Regards. | [reply] [d/l] |
|
|