http://qs321.pair.com?node_id=1230426


in reply to Re^3: how to read input from a file, one section at a time?
in thread how to read input from a file, one section at a time?

Hi, Thank you. It works perfectly.

One last thing I would like my script to do is tho look at those "number of proteins" and tell me which entry in the original input file has the smallest number of proteins. So, it should look like following in the output:

A=11 D=5 E=12 F=1 G=5 I=6 K=3 L=7 M=2 N=4 P=2 Q=9 R=10 T=4 V=10

Number of proteins = 15

Entry ">sp|Q2M7X4|YICS_ECOLI Uncharacterized protein YicS OS=Escherichia coli (strain K12) OX=83333 GN=yicS PE=4 SV=1" has the least number of proteins

  • Comment on Re^4: how to read input from a file, one section at a time?

Replies are listed 'Best First'.
Re^5: how to read input from a file, one section at a time?
by poj (Abbot) on Feb 23, 2019 at 11:05 UTC
    #!/usr/bin/perl use strict; use warnings; my $report_name = 'aa_report.txt'; open my $out_file, '>', $report_name or die "Cannot open '$report_name' because: $!"; print 'PLEASE ENTER THE FILENAME OF THE PROTEIN SEQUENCE: '; chomp( my $prot_filename = <STDIN> ); open my $PROTFILE, '<', $prot_filename or die "Cannot open '$prot_filename' because: $!"; $/ = ''; # Set paragraph mode my @count=(); my $name; while ( my $para = <$PROTFILE> ) { # Remove fasta header line if ( $para =~ s/^>(.*)//m ){ $name = $1; }; # Remove comment line(s) $para =~ s/^\s*#.*//mg; my %prot; $para =~ s/([A-Z])/ ++$prot{ $1 } /eg; my $num = scalar keys %prot; push @count,[$num,$name]; printf "Counted %d for %s ..\n",$num,substr($name,0,50); print $out_file "$name\n"; print $out_file join( ' ', map "$_=$prot{$_}", sort keys %prot ), +"\n"; printf $out_file "Number of proteins = %d\n\n",$num ; } # sort names by count in ascending order to get lowest my @sorted = sort { $a->[0] <=> $b->[0] } @count; my $lowest = $sorted[0]->[0]; # maybe more than 1 lowest printf $out_file "Least number of proteins is %d in these entries\n",$ +lowest; my @lowest = grep { $_->[0] == $lowest } @sorted; print $out_file "$_->[1]\n" for @lowest; # show all results print $out_file "\nAll results in ascending count\n"; for (@sorted){ printf $out_file "%d %s\n",@$_; }; close $out_file; print "Results in $report_name\n"
    poj
      Thank you so much. This is exactly what I was looking for. I really appreciate your help.
      In the above written script, how can I make the script to spit out the length of the sequence that is being read? So, after the line  printf $out_file "Number of proteins = %d\n\n",$num ; in the above code, I tried -  printf $out_file "string length = length($num) ; but nothing happens. What am I doing wrong?

        You need to provide a value to printf for example

        printf $out_file "string length = %d\n",length($num) ;
        but that gives you the length of the count value not the sequence. You need to calculate the sequence length before the value is changed by this counting regex $para =~ s/([A-Z])/ ++$prot{ $1 } /eg;

        Try making these changes

        # Remove comment line(s) and white space $para =~ s/^\s*#.*//mg; $para =~ s/\s//g; # add my $seq_length = length($para); # add print "[$para]\n"; # optional . . printf $out_file "Number of proteins = %d\n",$num ; printf $out_file "String length = %d\n\n",$seq_length; # add
        poj
      Hello Poj, In continuation to my previous question, I now want to count how many times a variable is absent. Ex: if in any given file, multiple entries don't have a W, I want the script to give me the output with the number of entries that don't have a W and similarly for other alphabets. So for instance, if 20 entries out of 100 in a file don't have a W, I want the output to be like W=20. How can that be done?

        Declare a hash to hold the counts before the loop

        my %absent=();

        Count those missing inside the loop

        for ('A'..'Z'){ ++$absent{$_} unless exists $prot{$_}; }

        print the results after the loop

        # print absent counts for (sort keys %absent){ printf "%s=%d\n",$_,$absent{$_}; };
        poj