Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Re: Extract multiple lists od Identifiers from a FASTA file

by kcott (Archbishop)
on Aug 04, 2022 at 23:24 UTC ( [id://11145959]=note: print w/replies, xml ) Need Help??


in reply to Extract multiple lists od Identifiers from a FASTA file

G'day joluito,

Welcome to the Monastery.

Yes, some dummy input and output files would have probably been useful. You should keep these small but still representative of your real data.

I see a variety of comments on your code. I won't revisit these but I will make two points:

  • When you change a special variable, such as $/, you should localise it and restrict the change to the smallest possible scope. See perlvar and local.
  • Avoid reading all data into an array; then reading it all again from that array. There are occasions when this is useful, but generally it isn't: this is particularly true with biological data which, as I'm sure you're aware, can be huge. Reading and processing data one piece at a time with, for example, while and for loops, will reduce CPU cycles, memory requirements, and the overall time your program takes to run.

The following code, fasta_munge.pl, shows how I might have handled your requirements.

#!/usr/bin/env perl use strict; use warnings; use autodie; my $dir = '.'; opendir(my $dh, $dir); for (readdir $dh) { next unless /^(.+?)\.fasta/; my ($id_path, $fasta_path) = ("$dir/$1.txt", "$dir/$_"); next unless -e $id_path; my %ids = map +($_ => 1), split ' ', do { local $/; open my $fh, '<', $id_path; <$fh>; }; { open my $fh, '<', $fasta_path; local $/ = '>'; while (<$fh>) { chomp; if (/\A[^|]+\|([^)]+)/) { print { _get_out_fh($dir, $1) } "$/$_" if $ids{$1}; } } } } _close_all_out_fhs(); closedir $dh; { my %out_fh_for; sub _get_out_fh { my ($dir, $id) = @_; unless (exists $out_fh_for{$id}) { open $out_fh_for{$id}, '>', "$dir/$id.fasta"; } return $out_fh_for{$id}; } sub _close_all_out_fhs { close $_ for values %out_fh_for; return; } }

I put together some dummy data for testing. Here's a full run with all the details.

ken@titan ~/tmp/pm_11145943 $ ls -l total 8 -rwxr-xr-x 1 ken None 976 Aug 5 08:26 fasta_munge.pl -rw-r--r-- 1 ken None 257 Aug 5 08:28 one.fasta -rw-r--r-- 1 ken None 23 Aug 5 05:32 one.txt -rw-r--r-- 1 ken None 402 Aug 5 08:28 two.fasta -rw-r--r-- 1 ken None 23 Aug 5 05:35 two.txt ken@titan ~/tmp/pm_11145943 $ cat one.fasta >one:VFG000033(gb|WP_002208793) PGRTPGVERVAHRDDDEHSWHLYAIRIHPQAPLKCDDFIVRMTENGIGCSVHYVPLHLQP YWRDRYGLTPDMYPHSQAAFEGMASLPIYSRMTDADVQRVIASVRQLLRP >one:VFG000036(gb|NP_490509) ALLISLMALGVKAGDEVITTSFTFVATAEVIALLGAKPVFVDVEPDTCNIKVSEIEAKIT PRTKAIIPVSLYGQCGDMDEV ken@titan ~/tmp/pm_11145943 $ cat one.txt WP_002208793 NP_490509 ken@titan ~/tmp/pm_11145943 $ cat two.fasta >two:VFG000033(gb|WP_002208793) PGRTPGVERVAHRDDDEHSWHLYAIRIHPQAPLKCDDFIVRMTENGIGCSVHYVPLHLQP YWRDRYGLTPDMYPHSQAAFEGMASLPIYSRMTDADVQRVIASVRQLLRP >two:VFG000032(gb|WP_002208792) PGRTPGVERVAHRDDDEHSWHLYAIRIHPQAPLKCDDFIVRMTENGIGCSVHYVPLHLQP YWRDRYGLTPDMYPHSQAAFEGMASLPIYSRMTDADVQRVIASVRQLLRP >two:VFG000036(gb|NP_490509) ALLISLMALGVKAGDEVITTSFTFVATAEVIALLGAKPVFVDVEPDTCNIKVSEIEAKIT PRTKAIIPVSLYGQCGDMDEV ken@titan ~/tmp/pm_11145943 $ cat two.txt WP_002208792 NP_490509 ken@titan ~/tmp/pm_11145943 $ ./fasta_munge.pl ken@titan ~/tmp/pm_11145943 $ ls -l total 11 -rwxr-xr-x 1 ken None 976 Aug 5 08:26 fasta_munge.pl -rw-r--r-- 1 ken None 224 Aug 5 08:31 NP_490509.fasta -rw-r--r-- 1 ken None 257 Aug 5 08:28 one.fasta -rw-r--r-- 1 ken None 23 Aug 5 05:32 one.txt -rw-r--r-- 1 ken None 402 Aug 5 08:28 two.fasta -rw-r--r-- 1 ken None 23 Aug 5 05:35 two.txt -rw-r--r-- 1 ken None 145 Aug 5 08:31 WP_002208792.fasta -rw-r--r-- 1 ken None 145 Aug 5 08:31 WP_002208793.fasta ken@titan ~/tmp/pm_11145943 $ cat NP_490509.fasta >one:VFG000036(gb|NP_490509) ALLISLMALGVKAGDEVITTSFTFVATAEVIALLGAKPVFVDVEPDTCNIKVSEIEAKIT PRTKAIIPVSLYGQCGDMDEV >two:VFG000036(gb|NP_490509) ALLISLMALGVKAGDEVITTSFTFVATAEVIALLGAKPVFVDVEPDTCNIKVSEIEAKIT PRTKAIIPVSLYGQCGDMDEV ken@titan ~/tmp/pm_11145943 $ cat WP_002208792.fasta >two:VFG000032(gb|WP_002208792) PGRTPGVERVAHRDDDEHSWHLYAIRIHPQAPLKCDDFIVRMTENGIGCSVHYVPLHLQP YWRDRYGLTPDMYPHSQAAFEGMASLPIYSRMTDADVQRVIASVRQLLRP ken@titan ~/tmp/pm_11145943 $ cat WP_002208793.fasta >one:VFG000033(gb|WP_002208793) PGRTPGVERVAHRDDDEHSWHLYAIRIHPQAPLKCDDFIVRMTENGIGCSVHYVPLHLQP YWRDRYGLTPDMYPHSQAAFEGMASLPIYSRMTDADVQRVIASVRQLLRP ken@titan ~/tmp/pm_11145943 $

— Ken

Replies are listed 'Best First'.
Re^2: Extract multiple lists od Identifiers from a FASTA file
by joluito (Novice) on Aug 05, 2022 at 07:32 UTC

    Hi Ken,

    Thank you very much for your help!

    I totally agree with you! But in this case, since this particular data is not too big to handle easily, I didn't care very much about computational requirements... my fault.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11145959]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (4)
As of 2024-04-25 07:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found