Re^7: how to read input from a file, one section at a time?

in reply to Re^6: how to read input from a file, one section at a time?
in thread how to read input from a file, one section at a time?

Declare a hash to hold the counts before the loop

my %absent=();

Count those missing inside the loop

for ('A'..'Z'){
  ++$absent{$_} unless exists $prot{$_};
}
[download]

print the results after the loop

# print absent counts
for (sort keys %absent){
  printf "%s=%d\n",$_,$absent{$_};
};
[download]

#!/usr/bin/perl
use strict;
use warnings;

my $report_name = 'aa_report.txt';
open my $out_file, '>', $report_name 
     or die "Cannot open '$report_name' because: $!";

print 'PLEASE ENTER THE FILENAME OF THE PROTEIN SEQUENCE: ';
chomp( my $prot_filename = <STDIN> );

open my $PROTFILE, '<', $prot_filename 
  or die "Cannot open '$prot_filename' because: $!";

$/ = ''; # Set paragraph mode

my @count=();
my %absent=();
my $name;
while ( my $para = <$PROTFILE> ) {
    # Remove fasta header line
    if ( $para =~ s/^>(.*)//m ){
      $name = $1;
    };
    # Remove comment line(s)
    $para =~ s/^\s*#.*//mg;

    my %prot;
    $para =~ s/([A-Z])/ ++$prot{ $1 } /eg;
    
    my $num = scalar keys %prot;
    push @count,[$num,$name];
    printf "Counted %d for %s ..\n",$num,substr($name,0,50);
    
    print $out_file "$name\n";
    print $out_file join( ' ', map "$_=$prot{$_}", sort keys %prot ), 
+"\n";
    printf $out_file "Number of proteins = %d\n\n",$num ;
    
    # count absent
    for ('A'..'Z'){
      ++$absent{$_} unless exists $prot{$_};
    };
     
};

# sort names by count in ascending order to get lowest
my @sorted = sort { $a->[0] <=> $b->[0] } @count;
my $lowest = $sorted[0]->[0];

# maybe more than 1 lowest
printf $out_file "Least number of proteins is %d in these entries\n",$
+lowest;
my @lowest = grep { $_->[0] == $lowest } @sorted;
print $out_file "$_->[1]\n" for @lowest;

# show all results
print $out_file "\nAll results in ascending count\n";
for (@sorted){
  printf $out_file "%d  %s\n",@$_;
};
close $out_file;
print "Results in $report_name\n";

# print absent counts
for (sort keys %absent){
  printf "%s=%d\n",$_,$absent{$_};
};
[download]

poj

Comment on Re^7: how to read input from a file, one section at a time? Select or Download Code

Replies are listed 'Best First'.
Re^8: how to read input from a file, one section at a time? by davi54 (Sexton) on Apr 01, 2019 at 21:59 UTC
Hi Poj, Thanks again for your prompt help. I really appreciate it. The script works perfect. Although I have a small issue. Actually my input file has multiple duplicate entries. Is there any way to get rid of duplicate entries from the file before starting with the actual analysis that this script does? I was thinking if there is a way to compare the fasta headers before getting rid of them to check if there are duplicate entries? It can be a separate script (which can be run before this one) or can be a part of this script. Again, thank you so much for your help and time.	[reply]
Re^9: how to read input from a file, one section at a time? by AnomalousMonk (Archbishop) on Apr 01, 2019 at 23:31 UTC
From poj's code: `my $name; while ( my $para = <$PROTFILE> ) { # Remove fasta header line if ( $para =~ s/^>(.)//m ){ $name = $1; }; ... }` [download] A quick and dirty and UNTESTED* modification to do what I think you want: `my $name; my %name_seen; # fasta headers seen so far FASTA_RECORD: while ( my $para = <$PROTFILE> ) { # Remove fasta header line if ( $para =~ s/^>(.)//m ){ $name = $1; next FASTA_RECORD if $name_seen{ $name }++; }; ... }` [download] Warning:* The requirement to "... get rid of duplicate entries ..." is ambiguous. If there is more than one entry with the same header (i.e., `$name`), which is (or are, if there are more than two) the duplicate(s)? The first one? The last one? Etc. The code modification above ignores all entries with a given `$name` after the first one. Also, it might be wise to trim all leading/trailing whitespace from `$name` before any further processing whatsoever (also untested): `$name = $1;` `$name =~ s{ \A \s+ \| \s+ \z }{}xmsg;` Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^10: how to read input from a file, one section at a time? by davi54 (Sexton) on Apr 02, 2019 at 15:32 UTC
Hi, My apologies for not being clear. Just to let you know, multiple proteins can have different header sequences but identical sequence information. When I say duplicate entries, I mean the actual sequence (and not the header). I want the script to read the input file and identify if there are more than one entries with the same sequence information and print them. Does that help? Again, sorry for the confusion and thank you for your help.	[reply]
Re^11: how to read input from a file, one section at a time? by poj (Abbot) on Apr 02, 2019 at 15:43 UTC
Re^12: how to read input from a file, one section at a time? by davi54 (Sexton) on Apr 02, 2019 at 16:02 UTC
Some notes below your chosen depth have not been shown here
Re^8: how to read input from a file, one section at a time? by davi54 (Sexton) on Apr 02, 2019 at 21:30 UTC
Hi Poj, I'll try not to ask any further questions. Actually, after I ran the cleanup script, I tried to run the above count script (for counting existing and missing alphabets) that we discussed earlier. But now it has started giving me errors such as: Use of uninitialized value in printf at present_and_absent.pl line 58, <$PROTFILE> chunk 1426. Use of uninitialized value $name in printf at present_and_absent.pl line 33, <$PROTFILE> chunk 1178. Use of uninitialized value $name in concatenation (.) or string at present_and_absent.pl line 35, <$PROTFILE> chunk 1178. Can you please help me with this?	[reply]
Re^9: how to read input from a file, one section at a time? by davi54 (Sexton) on Apr 02, 2019 at 22:12 UTC
Nevermind, I figured it out. There is no ">" in the cleaned headers which is why the script is getting errors. Thanks again!	[reply]

In Section Seekers of Perl Wisdom