comment on

I currently load a FASTA formatted file, which contains DNA sequence separated by headers denoted by '>' i.e.:

>header1
AGACATGATCGACTGGACACAATTTACGTAGCTG
>header2
AAATACTAGGGCAACACACACACACACCACACAC
>header3
AGTTAGTCCAGTAGTTTACAGTACGACTGATCGT
[download]

via bioPerl (Bio::SeqIO), and store the entire FASTA file into memory as a hash:

my $fasta = $ARGV[0];
my $seqio = Bio::SeqIO->new(-file => $fasta);
while(my $seqobj = $seqio->next_seq) {
    my $id  = $seqobj->display_id;
    my $seq = $seqobj->seq;
    $sequences{$id} = $seq;
}
open my $IN2, '<', $ARGV[1] or die "$!";
while (<$IN2>) {
    # Subsample the FASTA sequences
}
[download]

Or I could code it manually (which is likely quicker given the amount of overhead in Bio::SeqIO).

Based on a second input file, I then subsample the FASTA sequences and do some processing.

Each line of $IN2 includes the header from its associated FASTA sequence.

This works fine, and on modern computers the memory usage isn't too much of a problem even with larger genomes (Human etc.). Nevertheless, i'd like to take advantage of the fact that my second input file ($IN2) is always sorted by header and thus, for example, the first 5000 lines of it only need the FASTA sequence from >header1. This could potentially save me having to load the entire FASTA file into memory and instead only load the sequence that is currently being used.

I'm however having a tough time trying to figure out how to actually do this so any suggestions or code would be greatly appreciated.

In reply to Loading only portions of a file at any given time by TJCooper

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


go ahead... be a heretic
	PerlMonks