how to read input from a file, one section at a time?

davi54 has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I'm trying to write a perl script which reads multiple fasta entries from one input file and give me the number of each amino acid for each entry. But for some reason, perl is reading all the contents of the file as one entry and is giving me the total number of each amino acids in the file as a whole. I am attaching my scripts and files below. Please help.

----------------------------------------------------------------------input file------------------------------------------------------------------

>sp|P09153|TFAE_ECOLI Prophage tail fiber assembly protein homolog Tfa
+E OS=Escherichia coli (strain K12) OX=83333 GN=tfaE PE=1 SV=2
MHKAILNSDLIATKAGDVTVYNYDGETREYISTSNEYLAVGVGIPACSCLDAPGTHKAGY
AICRSADFNSWEYVPDHRGETVYSTKTGESKEIKAPGDYPENTTTIAPLSPYDKWDGEKW
VTDTEAQHSAAVDAAEAQRQSLIDAAMASISLIQLKLQAGRKLTQAETTRLNAVLDYIDA
VTATDTSTAPDVIWPELPEA

>sp|P20605|FIC_ECOLI Probable protein adenylyltransferase Fic OS=Esche
+richia coli (strain K12) OX=83333 GN=fic PE=1 SV=1
MSDKFGEGRDPYLYPGLDIMRNRLNIRQQQRLEQAAYEMTALRAATIELGPLVRGLPHLR
TIHRQLYQDIFDWAGQLREVDIYQGDTPFCHFAYIEKEGNALMQDLEEEGYLVGLEKAKF
VERLAHYYCEINVLHPFRVGSGLAQRIFFEQLAIHAGYQLSWQGIEKEAWNQANQSGAMG
DLTALQMIFSKVVSEAGESE

>sp|P0ADH5|FIMB_ECOLI Type 1 fimbriae regulatory protein FimB OS=Esche
+richia coli (strain K12) OX=83333 GN=fimB PE=3 SV=1
MKNKADNKKRNFLTHSEIESLLKAANTGPHAARNYCLTLLCFIHGFRASEICRLRISDID
LKAKCIYIHRLKKGFSTTHPLLNKEVQALKNWLSIRTSYPHAESEWVFLSRKGNPLSRQQ
FYHIISTSGGNAGLSLEIHPHMLRHSCGFALANMGIDTRLIQDYLGHRNIRHTVWYTASN
AGRFYGIWDRARGRQRHAVL
[download]

---------------------------------------------------------------------------end of input file-------------------------------------------------------------------------

----------------------------------------------------------------------------------perl script---------------------------------------------------------------

print "This script will count the number of amino acids\n\n";
use strict;
use warnings;

#variables
my $A=0;
my $C=0;
my $D=0;
my $E=0;
my $F=0;
my $G=0;
my $H=0;
my $I=0;
my $K=0;
my $L=0;
my $M=0;
my $N=0;
my $P=0;
my $Q=0;
my $R=0;
my $S=0;
my $T=0;
my $V=0;
my $W=0;
my $Y=0;
my @prot;
my $prot_filename;
my $line;
my $sequence;
my $aa;
open (my $out_file, '>', 'aa_report.txt');

print "PLEASE ENTER THE FILENAME OF THE PROTEIN SEQUENCE: ";
chomp($prot_filename=<STDIN>);

open(PROTFILE,$prot_filename) or die "unable to open the file";
@prot=<PROTFILE>;
close PROTFILE;


foreach $line (@prot) {

# discard blank line
if ($line =~ /^\s*$/) {
next;

# # discard comment line
} elsif($line =~ /^\s*#/) {
next;

# discard fasta header line
} elsif($line =~ /^>/) {
next;

# keep line, add to sequence string
} else {
$sequence .= $line;
}
}

# remove non-sequence data (in this case, whitespace) from $sequence s
+tring
$sequence =~ s/\s//g;
@prot=split("",$sequence); #splits string into an array
print " \nThe original PROTEIN file is:\n$sequence \n";

while(@prot){
$aa = shift (@prot);
if($aa =~/[A]/ig){
$A++;
}
if($aa=~/[C]/ig){
$C++;
}
if($aa=~/[D]/ig){
$D++;
}
if($aa=~/[E]/ig){
$E++;
}
if($aa=~/[F]/ig){
$F++;
}
if($aa=~/[G]/ig){
$G++;
}
if($aa=~/[H]/ig){
$H++;
}
if($aa=~/[I]/ig){
$I++;
}
if($aa=~/[K]/ig){
$K++;
}
if($aa=~/[L]/ig){
$L++;
}
if($aa=~/[M]/ig){
$M++;
}
if($aa=~/[N]/ig){
$N++;
}
if($aa=~/[P]/ig){
$P++;
}
if($aa=~/[Q]/ig){
$Q++;
}
if($aa=~/[R]/ig){
$R++;
}
if($aa=~/[S]/ig){
$S++;
}
if($aa=~/[T]/ig){
$T++;
}
if($aa=~/[V]/ig){
$V++;
}
if($aa=~/[W]/ig){
$W++;
}
if($aa=~/[Y]/ig){
$Y++;
}
}

print "\n";
print $out_file "A=$A C=$C D=$D E=$E F=$F G=$G H=$H I=$I K=$K L=$L M=$
+M N=$N P=$P Q=$Q R=$R S=$S T=$T V=$V W=$W Y=$Y ";
print "\n";
print "done";
[download]

---------------------------------------------------------------end of perl script----------------------------------------------------------------------------

------------------------------------------------------------------------output generated in aa_report.txt------------------------------------------------------------

A=61 C=10 D=31 E=40 F=18 G=41 H=23 I=39 K=28 L=57 M=11 N=24 P=21 Q=27 
+R=37 S=36 T=35 V=23 W=11 Y=27
[download]

-----------------------------------------------------------------------end of output generated in aa_report.txt------------------------------------------------------

I want it to read the three entries in the input file as 3 separate entries and generate separate output for them in the output file. I am guessing it will be a loop or something like that but I'm not sure. Please help.

Thanks in advance

2019-02-25 Athanasius added code tags

Comment on how to read input from a file, one section at a time?
Select or Download Code

Replies are listed 'Best First'.
Re: how to read input from a file, one section at a time? by hippo (Bishop) on Feb 21, 2019 at 22:33 UTC
Please help. You will likely receive substantially more help if you take the time to enclose each datafile and program in its own pair of `<code>` tags. See Markup in the Monastery for more help on that. Even without that, code which looks like this sets alarm bells ringing: `if($aa =~/A/ig){ $A++; } if($aa=~/C/ig){ $C++; } if($aa=~/D/ig){ $D++; + } if($aa=~/E/ig){ $E++; } if($aa=~/F/ig){ $F++; } if($aa=~/G/ig){ $G +++; } if($aa=~/H/ig){ $H++; } if($aa=~/I/ig){ $I++; } if($aa=~/K/ig){ + $K++; } if($aa=~/L/ig){ $L++; } if($aa=~/M/ig){ $M++; } if($aa=~/N/i +g){ $N++; } if($aa=~/P/ig){ $P++; } if($aa=~/Q/ig){ $Q++; } if($aa=~/ +R/ig){ $R++; } if($aa=~/S/ig){ $S++; } if($aa=~/T/ig){ $T++; } if($aa +=~/V/ig){ $V++; } if($aa=~/W/ig){ $W++; } if($aa=~/Y/ig){ $Y++; }` [download] Why not use a hash instead?	[reply] [d/l] [select]
Re: how to read input from a file, one section at a time? by jwkrahn (Abbot) on Feb 21, 2019 at 22:49 UTC
This may work: (UNTESTED) open my $out_file, '>', 'aa_report.txt' or die "Cannot open 'aa_report +.txt' because: $!"; print 'PLEASE ENTER THE FILENAME OF THE PROTEIN SEQUENCE: '; chomp( my $prot_filename = <STDIN> ); open my $PROTFILE, '<', $prot_filename or die "Cannot open '$prot_file +name' because: $!"; $/ = ''; # Set paragraph mode while ( my $para = <$PROTFILE> ) { # Remove fasta header line $para =~ s/^>.//m; # Remove comment line(s) $para =~ s/^\s#.*//mg; my %prot; $para =~ s/([A-Z])/ ++$prot{ $1 } /eg; print $out_file join( ' ', map "$_=$prot{$_}", sort keys %prot ), +"\n"; } [download]	[reply] [d/l]
Re^2: how to read input from a file, one section at a time? by davi54 (Sexton) on Feb 21, 2019 at 23:38 UTC
Hi, thank you so much. It works. I have a following question. Once I have the output in the output file, I want to read the number of alphabets present in the output file for each sequence. Example: for the output: "A=27 C=3 D=16 E=14 F=1 G=11 H=4 I=12 K=10 L=13 M=2 N=6 P=10 Q=6 R=6 S=14 T=20 V=11 W=4 Y=10" I want it to count the alphabets in each result and return how many alphabets are present. In the above example, 20 alphabets are present. So I want my final output to look something like: number of alphabets = 20. How can I do it?	[reply]
Re^3: how to read input from a file, one section at a time? by Cristoforo (Curate) on Feb 22, 2019 at 00:48 UTC
The line below will give you those counts. `print $out_file "Number of proteins = ", scalar keys %prot, "\n";` [download]	[reply] [d/l]
Re^4: how to read input from a file, one section at a time? by davi54 (Sexton) on Feb 22, 2019 at 21:10 UTC
Re^5: how to read input from a file, one section at a time? by poj (Abbot) on Feb 23, 2019 at 11:05 UTC
Some notes below your chosen depth have not been shown here
Re^3: how to read input from a file, one section at a time? by BillKSmith (Monsignor) on Feb 22, 2019 at 03:06 UTC
`>type davi54.pl use strict; use warnings; my $string = "A=27 C=3 D=16 E=14 " . "F=1 G=11 H=4 I=12 " . "K=10 L=13 M=2 N=6 " . "P=10 Q=6 R=6 S=14 " . "T=20 V=11 W=4 Y=10" ; print $string =~ tr/A-Z/A-Z/; >perl davi54.pl 20` [download] Bill	[reply] [d/l]
Re: how to read input from a file, one section at a time? by davi54 (Sexton) on Mar 28, 2019 at 16:43 UTC
Hello everyone, In continuation to my previous question, I now want to count how many times a variable is absent. Ex: if in any given file multiple entries don't have a W, I want the script to give me the output with the number of entries that don't have a W and similarly for other alphabets. How can that be done?	[reply]
Re^2: how to read input from a file, one section at a time? by marto (Cardinal) on Mar 29, 2019 at 07:09 UTC
You could show us what you tried, and tell us how it failed. Consider if you could adapt the existing code you've been given to achieve this. Tutorials->Getting Started with Perl, https://learn.perl.org.	[reply]


good chemistry is complicated, and a little bit messy -LW
	PerlMonks