Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Reading file into a hash

by PerlSufi (Friar)
on May 28, 2014 at 14:47 UTC ( [id://1087659]=perlquestion: print w/replies, xml ) Need Help??

PerlSufi has asked for the wisdom of the Perl Monks concerning the following question:

UPDATE: Solved and removed if(defined $seq) line
Hello monks, I have been trying a bio perl challenge site- so meta hints are welcome, too. I am having trouble reading a file into a hash properly. My data will be a fasta file in the format of:
>sequence_5849 CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC TCCCACTAATAATTCTGAGG >sequence_5959 CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT ATATCCATTTGTCAGCAGACACGC >sequence_0808 CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC TGGGAACCTGCGGGCAGTAGGTGGAAT
my code is:
#! /usr/bin/perl use strict; use warnings; use Data::Dumper; use feature qw(say); my $file = 'file.txt'; open (my $fh, '<', $file) or die "Could not open file '$file' $!"; my (%sequence_hash, $header, $seq, $count); while ( my $line = <$fh> ) { chomp($line); if ( $line =~ m/^>(.*)/ ) { if ( $seq ) { say $seq; $sequence_hash{$header} = $seq; } $header = $1; $seq = ''; } else { $seq .= $line; } } close $fh; print Dumper(\%sequence_hash);
My problem is that I am not getting the header and sequence. I have a feeling that its because of clearing out the $seq variable, but I am not sure how else to get the header and sequences in the hash. Any insight would be highly appreciated :)

Replies are listed 'Best First'.
Re: Reading file into a hash
by davido (Cardinal) on May 28, 2014 at 16:56 UTC

    Your records always start with a '>'. But you don't have to look at that character as the start of a record, you can look at it as a record separator. Then if you set $/ = '>';, your file reads will not read lines, but records. That (in my mind) simplifies the process. Here's one way to do it:

    use strict; use warnings; use Data::Dumper; local $/ = '>'; my %hash; while( <DATA> ) { if( my($k,$v) = / ^(\N+)\n # Capture the key. ([CTGA]+)$ # Capture the value. /mx ) { $hash{$k}=$v; } } print Dumper \%hash; __DATA__ >sequence_5849 CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAAT +AATTCTGAGG >sequence_5959 CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCTATATCCATTT +GTCAGCAGACACGC >sequence_0808 CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGACTGGGAACCTG +CGGGCAGTAGGTGGAAT

    Update: My solution assumes that there aren't newlines embedded in the CTG.... sequences. If there are, your regex should probably accept them, and then use tr/// to filter them out again. Also, since you're modifying the input record separator, it's wise to constrain its effects to as narrow a scope as possible. The following does that:

    print Dumper sub { local $/ = '>'; my( $fh, %recs ) = shift; while( <$fh> ) { if( my($k,$v) = / ^(\N+)\n # Capture the key. ([CTGA]+)$ # Capture the value. /mx ) { $recs{$k}=$v; } } return \%recs; }->(\*DATA);

    Dave

      Very Nice, davido. I have not used $/ much so I will keep that in mind for future reference :)
Re: Reading file into a hash
by hippo (Bishop) on May 28, 2014 at 14:51 UTC

    I think that the problem is that you are setting the value of $seq to be '' but then testing it via defined, which (other than the first time through) it will be. Change your test from if ( defined $seq ) to just if ($seq) and see what difference that makes.

      Hi hippo,
      Thanks for the response. I changed it and still did not get the last header and sequence
        Thats a very common problem, you are trying to add a record only when the successor is to be parsed, but the last record has no successor (sic ;)!

        Most people try to solve by repeating code to add the last record after the loop.

        But it's much cleaner this way (avoiding a posteriori state logic)

        use strict; use warnings; use Data::Dump; my $header; my %sequence; while ( my $line = <DATA> ){ chomp $line; if ( $line =~ /^>(.*)/ ) { $header = $1; } else { $sequence{$header} .= $line; } } dd \%sequence; __DATA__ >sequence_5849 CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC TCCCACTAATAATTCTGAGG >sequence_5959 CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT ATATCCATTTGTCAGCAGACACGC >sequence_0808 CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC TGGGAACCTGCGGGCAGTAGGTGGAAT
        output
        { sequence_0808 => "CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGAC +CAGCCCGGACTGGGAACCTGCGGGCAGTAGGTGGAAT", sequence_5849 => "CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTC +CGGCCTTCCCTCCCACTAATAATTCTGAGG", sequence_5959 => "CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCG +CCGAAGGTCTATATCCATTTGTCAGCAGACACGC", }

        Cheers Rolf

        ( addicted to the Perl Programming Language)

        update

        a *general pattern* to solve such problems while staying DRY is to use references

        if ( $line =~ /^(HEAD_PATTERN)/ ) { $data = \ $deeply{nested}{structure}{$1}; # reference data + } else { $$data .= $line; # derefrence data }

        like this you don't need to repeat the path of a deeply nested data structure, which might vary in multiple dimensions

        update
        added some explanation
Re: Reading file into a hash
by LanX (Saint) on May 28, 2014 at 14:53 UTC
    at first glance¹ this doesn't seem to make much sense:

    if ( defined $seq ) { # ... $seq = '';

    empty strings are defined, you may either want to assign undef or just use that '' is false.

    Cheers Rolf

    ( addicted to the Perl Programming Language)

    ¹) that is w/o trying to understand the full intended logic

      Hi Lanx,
      I tried that, too. I also updated my post to reflect not using  if ( defined $seq) anymore

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1087659]
Front-paged by GotToBTru
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (3)
As of 2024-04-19 17:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found