http://qs321.pair.com?node_id=960496

anonym has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I have a multifasta file with me and i have used awk command in my perl script to split that into separate fasta files in the working directory.However , I have a question as to reading all those files and storing each sequence in a hash.

my $genome_file = @ARGV[0]; open(MULTIFASTA,"<",$genome_file) or die "Can't open $genome_file!\n"; my $header; my $sequence; $/ = "\n"; my %hash; my @fasta = `awk '/^>/{f=++d".fasta"} {print > f}' $genome_file`;

Then i want to read each individual fasta file generated in my directory and store the sequence in a hash.I really appreciate any help.Thanks

Replies are listed 'Best First'.
Re: reading files in directory
by GrandFather (Saint) on Mar 20, 2012 at 00:09 UTC

    Why split the file with awk if you are then going to process the data in memory anyway? Read and process the original file, perhaps writing the parts to new files along the way if they are needed for some other purpose.

    See readdir for the Perl tool used to scan directory contents.

    There's not much help we can give in reply to "I want to store data in a hash" because it simply doesn't tell us enough about why you want to do that nor what you want to store, nor how you want to access it.

    True laziness is hard work
Re: reading files in directory
by Riales (Hermit) on Mar 19, 2012 at 23:51 UTC

    Which part are you have trouble with, exactly? Opening the fasta files for reading? Parsing the files? Using a hash?

    What does a 'fasta' file look like? You say you want to store a sequence in a hash. What would you use as a key?

      The file format is

      >chr1 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNATGCNNNNNNNNNNNNNNNNGCAT... +..... >chr2 ATNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN +NNNN. .... EOF

      I have splitted this multifasta into 25 pieces which is present in my current working directory.However, I am confused as to how can I read all these fasta files and store the hash key as header line and value as sequence. Thanks

        Well, the first thing you'll need to do is tell Perl what the relevant files are. The simplest way to do that is to pass it in as a command-line argument. Look for @ARGV in perldoc perlvar for that.

        If the files are all in the same directory and there are no other files in the directory (and you don't want to type out 25 separate filenames), you could use a loop on STDIN to receive piped into into your script:

        my @filenames; push @filenames, $_ while (<STDIN>); chomp@filenames; use Data::Dumper; print Dumper(@filenames);

        That way, you can get all your filenames into your script by simply doing:

        ls | my_script.pl

        As for actually opening each file for reading, you've already done that. What's confusing about reading the files? Have you checked perldoc -f open?

        As for hashes, they're pretty simple:

        my %hash; # Declare your hash. $hash{$key} = $sequence; # Assign $sequence to the hash with key $key. my $sequence_im_looking_for = $hash{$key}; # Get the sequence by looki +ng it up with its key $key.

        It looks like you'll be alternating lines; line1 is the key for the sequence in line2, line3 is the key for the sequence in line4, etc.

Re: reading files in directory
by Sinistral (Monsignor) on Mar 20, 2012 at 13:44 UTC

    Don't reinvent the wheel.

    Perl seems to be the language of choice in the biological sciences, especially genetics. I don't know if you've seen it, but make sure you've checked out the many well thought out solutions available at the BioPerl site. In particular, I found a reference to the FASTA format.