Request to detect the mistake in a perl script for finding inter-substring distance from a large text file

supriyoch_2008 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Perl Monks,

I am a beginner in perl programming. I have written a perl script which can read a small text file and gives correct results for inter-substring distance in cmd in Windows XP. But cmd shows the problem of “out of memory” when I try to analyze a large text file with 219475005 letters for finding the inter-substring distance although the program counts the number of each letter in the file correctly within 2 minutes but fails to find the inter-substring distance. I think this could due to incorrect reading of file.

So I have given the initial part of the script and the results of cmd screen below. I am seeking your suggestions to rectify the mistake in the script for analyzing a large file.

Furthermore, I need the syntax at the initial part to assign the input large file to an array variable like my @linesso that I can assign this array to a scalar variable like my $string ="@lines";for use in later part of the script.

#!/usr/bin/perl –w
print "\n\nPlease type the filename: ";
$DNAfilename = <STDIN>;
chomp $DNAfilename;
# open the large file
unless ( open(DNAFILE, $DNAfilename) ) {
print "Cannot open file \"$DNAfilename\"\n\n";
exit;
} 
my @lines = <DNAFILE>; 
while (<DNAFILE>) {
  say $_;
} 
close DNAFILE;
$DNA = join( '', @lines);
# Remove whitespace
$DNA=~ s/\s//g;
# Count number of bases
$b=length($DNA);
print "\nNumber of bases: $b.";
# Count number of each base and nonbase
$A=0;$T=0;$G=0;$C=0;$e=0; 
while($DNA=~ /A/ig){$A++}
while($DNA=~ /T/ig){$T++}
while($DNA=~ /G/ig){$G++}
while($DNA=~ /C/ig){$C++}
while($DNA=~ /[^ATGC]/ig){$e++}
. . . .
[download]

Command Prompt Results:

C:\Documents and Settings\user\Desktop>m3.pl

Please type the filename of the DNA sequence data: chr1.txt

Number of bases: 219475005.

A=63473407; T=63582431; G=45425056; C=45435903; Errors(N)=1558208.

Enter a motif to count nt between two such motifs: GAATTCCT

I found the motif!

Out of memory!

C:\Documents and Settings\user\Desktop>

Thanks to Perl Monks for their quick reply in solving perl problems.

Comment on Request to detect the mistake in a perl script for finding inter-substring distance from a large text file Select or Download Code

Replies are listed 'Best First'.
Re: Request to detect the mistake in a perl script for finding inter-substring distance from a large text file by rovf (Priest) on Jan 24, 2012 at 10:02 UTC
`my @lines = <DNAFILE>; while (<DNAFILE>) { say $_; }` [download] This piece of code doesn't make sense. First, you read the whole file into memory (storing it at `@lines`), and then you try to read another line (your while loop), which is, of course, not possible. Your loop won't be executed; you can remove it without harm. But the main problem is that you read the whole file into memory and process it from there. No wonder that your memory gets exhausted sooner or later (try to pour a whole bottle of beer into a coffee cup; unless the cup is really huge, you will spill some beer). Maybe Tie::File will help you as a first start. It allows you to treat the whole file as an array, without slurping it into memory. Be aware that, possibly, the runtime of your application will increase. -- Ronald Fischer <ynnor@mm.st>	[reply] [d/l] [select]
Re: Request to detect the mistake in a perl script for finding inter-substring distance from a large text file by johngg (Canon) on Jan 24, 2012 at 10:25 UTC
A better way to count your bases would be to use tr. `knoppix@Microknoppix:~$ perl -Mstrict -wE ' > my $dna = q{CCATGNGTTATGNGTTACACGTNGTNTACG}; > my $b = length $dna; > my $A = $dna =~ tr{A}{}; > my $C = $dna =~ tr{C}{}; > my $G = $dna =~ tr{G}{}; > my $T = $dna =~ tr{T}{}; > my $e = $b - ( $A + $C + $G + $T ); > say qq{Number of bases - $b}; > say qq{A = $A; C = $C; G = $G; T = $T; err = $e};' Number of bases - 30 A = 5; C = 5; G = 7; T = 9; err = 4 knoppix@Microknoppix:~$` [download] I hope this is helpful. Cheers, JohnGG	[reply] [d/l]
Re: Request to detect the mistake in a perl script for finding inter-substring distance from a large text file by lune (Pilgrim) on Jan 24, 2012 at 12:09 UTC
As far as I can see there is no need to a) read in the whole file at once b) paste the lines together Your count won't change if you do the counting line by line - which would solve your memory problem. So it would be worth looking at the part of your program that needs the whole file as string to see, whether this could be changed too. If not, there is the proposition about using "tie" already. Then, you can search for all valid characters at once and use a hash to collect and count them. This would be a possible solution: `#!/usr/bin/perl -w use strict; use warnings; use diagnostics; use Data::Dumper; my $filename = "dna.txt"; open(my $fh, "<", $filename) \|\| die "could not open $filename: $!\n"; my %bases; my $cnt_errors = 0; while (<$fh>) { # strip spaces s/\s+//ig; # collect results my @results = ($_ =~ /[ACGT]/ig); map { $bases{$_}++ } @results; $cnt_errors += ( length($_) - scalar @results ); } print Dumper(%bases); print "Errors: $cnt_errors\n";` [download]	[reply] [d/l]

Back to Seekers of Perl Wisdom