Dear all,
Am trying to separate the data from a file into two different files based on the matching of either "GENEID" or "PROTID". Below is the input file.
input file:
>data_1 GENEID_8 1_exons 87028 - 87375 348 bp, chain -
ATGCCCAAATTAGTCAACATATTGATCACTACGGAGGAAATCTTGAAGAGTTCAAGGGGC
TGTCCATTTTACTTGAAGAGCCTAAAGATCAAAAAGGGTGATAATAAATCTTTAGAAGAT
ATGCTCATAATTGAATCTAACCTTACGATTTCTTCTACTTCTAATTGA
>data_1 PROTID_8 1_exons 87028 - 87375 115 aa, chain -
KLVNILITTEEILKSSRGIVLTVEQTSSIKRKFGWKKKKVKSAKKQKRESKPKKDGPK
AAEAKGKYFHYDADGHWRRNCPFYLKSLKIKKGDNKSLEDMLIIESNLTISSTSN
>data_2 GENEID_12 2_exons 121021 - 121590 486 bp, chain -
ATGTGGCACAACCGCCTAGGCCACATGGGTGACAAGGGGCTGAGGGAGTTGAGCAGGAGA
AGACACTTCTCAGTTAAGGGGACTCCACAGCAGAATGGGATGGCCGAGAGGATGAATAGA
ACACTTTTGGAAAAAGGCTCGATGCATGAGGCTGTAGGCAGAGCTTCCAAAGGCATTCTG
GGTTGA
>data_2 PROTID_12 2_exons 121021 - 121590 161 aa, chain -
LVHTDIYFMREKSEVFTKFKIWRAEVEKEQGRSVKCLRSDNGREYTSREFQDYCEECGIR
RHFSVKGTPQQNGMAERMNRTLLEKGSMHEAVGRASKGILG
program written so far:
#!/usr/bin/perl
open(OUT1,">GENEID.out")or die "can not create new file";
open(OUT2,">PROTID.out")or die "can not create new file";
open(FILE,"input.txt")or die "can not open file";
while ($line=<FILE>){
$hit1= $line=~ /^(>data_\d+\s+GENEID_\d+.*\n.*)/s;
print OUT1 "$hit1\n";
$hit2= $line=~ /^(>data_\d+\s+PROTID_\d+.*\n.*)/s;
print OUT2 "$hit2\n";
}
desired output:
file GENEID.out:
>data_1 GENEID_8 1_exons 87028 - 87375 348 bp, chain -
ATGCCCAAATTAGTCAACATATTGATCACTACGGAGGAAATCTTGAAGAGTTCAAGGGGC
TGTCCATTTTACTTGAAGAGCCTAAAGATCAAAAAGGGTGATAATAAATCTTTAGAAGAT
ATGCTCATAATTGAATCTAACCTTACGATTTCTTCTACTTCTAATTGA
>data_2 GENEID_12 2_exons 121021 - 121590 486 bp, chain -
ATGTGGCACAACCGCCTAGGCCACATGGGTGACAAGGGGCTGAGGGAGTTGAGCAGGAGA
AGACACTTCTCAGTTAAGGGGACTCCACAGCAGAATGGGATGGCCGAGAGGATGAATAGA
ACACTTTTGGAAAAAGGCTCGATGCATGAGGCTGTAGGCAGAGCTTCCAAAGGCATTCTG
GGTTGA
file PROTID.out
>data_1 PROTID_8 1_exons 87028 - 87375 115 aa, chain -
KLVNILITTEEILKSSRGIVLTVEQTSSIKRKFGWKKKKVKSAKKQKRESKPKKDGPK
AAEAKGKYFHYDADGHWRRNCPFYLKSLKIKKGDNKSLEDMLIIESNLTISSTSN
>data_2 PROTID_12 2_exons 121021 - 121590 161 aa, chain -
LVHTDIYFMREKSEVFTKFKIWRAEVEKEQGRSVKCLRSDNGREYTSREFQDYCEECGIR
RHFSVKGTPQQNGMAERMNRTLLEKGSMHEAVGRASKGILG
my results are giving only the headers(the line which starts with >) and not the alphabetic string. can any one please correct me in which line i am going wrong in my code? thank you.