I don't see any need to create a hash table.
Why do you think a hash table is necessary?
Yes, I think your regex is flawed because of this substr "window". What's up with that?
I downloaded your data file and I present one way to parse it into records.
The first record is so long that I omitted it.
Please take one of these records and show us the desired output.
#!usr/bin/perl
use strict;
use warnings;
get_record(); # skips the first ">" line
my $record;
while ( defined ( $record = get_record() ) )
{
process_record ($record);
}
###########
sub process_record
{
my $record = shift;
print "$record\n";
# please explain more about to process the record
# the spec from the prof would be appropriate
}
sub get_record #works with Perl DATA segment
{
my $line;
my $record = undef;
while ( defined ($line = <DATA>) and $line !~ /^>/)
{
chomp $line;
next if $line =~ /^\s*$/; # skip blank lines
$record .= $line;
}
return $record; #could return a reference, but let's get
#the process_record() logic right first
}
__DATA__
>211000022278617 type=golden_path_region; loc=211000022278617:1..1037;
+ ID=211000022278617; dbxref=GB:DS485855,GB:DS485855,REFSEQ:NW_0018471
+44; MD5=8fd22b4969d1f92433b80ee3837e69bc; length=1037; release=r6.18;
+ species=Dmel;
ATGACGAAAATTTCGTTTGTAAATATCAACATTTTTGCAGAGTCTGTTTTTCCAAATTTCGGGTCATCAA
+ATAATCATTT
ATTTTGCCACAACATAAAAAATAATTGTCTGAATATGGAATGTCATACCTCACTGAGCTCGTAATAAAAT
+TTCCAATCAA
ACTGTGTTCAAAAATGGAAATTAAATTTTTTGGCCATATTTTGCAAATTTTGATGACCACCCCTCCTTAC
+AAAAAATTCG
AAAATTGATCCAAAAATTAATTTCCTAAATCCTTCAAAAAGTAATAGGGATCGTTAGCACTGGTAATTAG
+CTGCTCAAAA
CAGTTATTCTTACATCTATGTGACTATTTTTAGCCAAGTTATGACGAAAATTTCGTTTGTAAATATCAAC
+ATTTTTGCAG
AGTCTGTTTTACCAAATTTCGGTCATCAAATAATCATTTATTTTGCCACAACATAAAAAATAAGTGTCTG
+AATATGGAAT
GTCATACCTCACCGATCTCGTAATAAAATTTCCAATAAAACTGTGTTCAAAAGAGGAAATTAAATTTGTT
+GGCCATATTT
TGCAAATTTTGATGACCCCCCTCCTTACAAAAAATGCGAAAATTGATCCAAAAATTAATTTCCCTAAATC
+CTTCAAAAAG
TAATAGGGATCGTTTGCACTGGTAATTAGCTGCTCAAAACAGTTATTCTTACATCTATGTGACCATTTTT
+AGCCAAGTTA
TAACGAAAATTTCGTTTGTAAATATCAACATTTTTGCAGAGTCTGTTTTTCCAAAATTTGGTCATCAAAT
+AATCATTTAT
TTTGCCACAACATTAAAAATAATTGTCAGAATATGGAATGTTATATTTCACTGAGCTCGTAATAAAATTT
+CCAATCAAAC
TGTGATCAAAAATGGAAATTAAATTTTTTGGCCATATTTTGCAAATTTTGATGACCCTCCTCCTTACAGA
+AAATGCGAAA
ATTGATCCAAAAATAAGTTTTCTAAATCCTTCAAAAAGTAATAGGGATCGTTAGCACTGGTAATTAGCTG
+CTCAAAA
>211000022278618 type=golden_path_region; loc=211000022278618:1..1452;
+ ID=211000022278618; dbxref=GB:DS484811,GB:DS484811,REFSEQ:NW_0018461
+00; MD5=28b88781b6ffbac76cf0ccd6f47258a1; length=1452; release=r6.18;
+ species=Dmel;
ATTTTGAGCAGCTAATTATCAGTGCTAACGATCCCTATTACTTTTTGAAGGATTTAGGGAAATTATTTTT
+TGGATCAATT
TTCGCATTTTTTGTAAGGAAGGGGGTCATCAAAATTTGCCAAATATGGCCAAAAAATTCAATTTCTATTT
+TTGAACACAG
TTTGATTGGATATTTTATTACGAGCTCAGTGAGGTATGACATTCCATGTTCAGACAATTATTTTTTATGT
+TGTGGCAAAA
TAAATGATTATTTGATGACCAAAATTTGGAAAAACAGATTCTGCAAAATGTAATATTTACAAACGAAATT
+TTCGTCATAA
CTTGGTTAAAAATGGTCACATAGATGTAAGAATAACTGTTTTGAGCAGCTAATAACCAGTGCTAACGATC
+CCTATTACTT
TTTGAAGGATTTAGGGAAATTAATTTTTGGATCAATTTTCGCATTTTATGTAAGGAGGGGGGTCATCAAA
+ATTTGCAAAA
TTATGCCAAAAAATTTAATTTCCATTTTTGAACACAGTTTGATTGGAAATTTTATTACGAGCTCAGTGAG
+GTATGACCTT
CCATATTCAGACAATTATTTTTTATGTTGTGGCAAAATAAATGATTATTTGATGACCGAAATTTGGAAAA
+ACAGATTCTG
CCAAAGAAGTAGATATTTACAAACGAAATTTTCGTCATAACTTGGTTAAAAATGGTCACATAGATGTAAG
+AATAACTGTT
TTGAGCAGCTAATTATCAGTGCAAACGATCCCTATTACTTTTTGAAGGATTTAGGGAAATTAATTTTTGG
+ATCAATTTTC
GCATTTTATGTAAGGAGGGGGGTCATCAAAATTTGCAAAATATGGCCAAAAAATTTAATTTCCATTTTTG
+AACACAGTTT
GATTGGAAATTTTATTACGAGCTCAGTGAGGTATGACATTCCATATTCAGACAATTATTTTTTATGTTGT
+GGCAAAATAA
ATGATTATTTGATGACCAAAATTTGGAAAAACAGACTCTGCAAAAATGTAGATATTTACAAACGAAATTT
+TCGTTATAAC
TTGGCTAAAAATGGTCACATAGATGTAAGAATAACTGTTTTGAGCAGCTAATAACCAGTGCTAACGATCC
+CTATTACTTT
TTGAAGGATTTAGGGAAATTAATTTTTGGATCAATTTTCGCATTTTATGTAAGGAGGGGGGTCATCAAAA
+TTTGCAAAAT
ATGGCCAAAAAATTTAATTTCCATTTTTGAACACAGTTTGATTGGAAATTTTATTACGAGCTCAGTGAGG
+TATGACATTC
CATATTCAGACAATTATTTTTTATGTTGTGGCAAAATAAATGATTATTTGATGACCAAAATTTGGAAAAA
+CAGACTCTGC
AAAAATGTAGATATTTACAAACGAAATTTTCGTTATAACTTGGCTAAAAATGGTCACATAGATGTAAGAA
+TAACTGTTTG
AGCAGCTAAAAC
>211000022278619 type=golden_path_region; loc=211000022278619:1..1986;
+ ID=211000022278619; dbxref=GB:DS484504,GB:DS484504,REFSEQ:NW_0018457
+93; MD5=35ebd37874cb5c55b4804e9786d17bfe; length=1986; release=r6.18;
+ species=Dmel;
ACAGTTATTCTTACATCTATGTGACAATTTTTAGCCAAGTTATAACGAAAATTTCGTTTGTAAATATCAT
+TACTTTGGCA
GAATCTGTTTTTCCACATTTCGGTCTTCAAATATCATTTATTTTGCCACAACATTAAAAATAATTGTCTG
+AATATGGAAT
GTCATACCTCACTGAGCTTGTAGTAAAATTTCCAATCAAACTGTGTTCAAAAATGGAATTAAATTTTTTG
+GCCATATTTT
GCAAATTTTGATGACCTTCTTCCAAAAATTGCAAAAATTGATCTAAAAATTAGTTTCCCTAAATCCTTCA
+AAAAGTAATA
GGGATCGTTAGCACTGGTAATTAGCTGCTCAAAACAGTTATTCTTAGATCTATGTGACCATTTTTAGCCA
+AGTTATAACG
AAAATTTCGTTTGTAAATATCAATATTTTGGCAGAATCTGTTTTTCCAAATTTCGGTCAAAAAATAATGA
+TTTATTTTGC
CACAACATAAAAAATAATTGTCTGAATATGGAATGTCATACCTCACTGAGCTTGTAATAAAATTTCCAAT
+CAAACTGTGT
TCAAAAATGGAAATTAAATTTTTTGGCCATATTTTGCAAATTTTGATGACCCCCGTCGTTACAAAAAATG
+TGAAAATTGA
TCGAAAAATTAATTTCCCTAAGTCCTTCAAAAAGTAATAGCGATCGTTAGCACTGGTAATTAGCTGCTCA
+AAACAGTTAT
TCTTACATCCACGTGACAGTTTTTAGCCAAGTTATAACGAAAATTTCGTTTGTAAATATCAACATTTTTG
+CAGAGTCTGT
TTTTCCAAAATTCGGTCATCAAATAATCATTTATTTTGCCACATTAAAAATAATTGTCAGAATATAGAAT
+GTCATACTTC
ACTGACTCGGAATAAAATTTCCAATCAAACTGGGTTCAAAAAATGGAAATTAAACTTTTTGGCCCTATAT
+TACAAATTTT
GATGACCTCCCTCCTTCCCAAAAATGTGAAAATTGATCTAAAAATTAATTTTCCTAAATCCTTCAAAAAG
+AAATAGCGAT
CATTAGCACTGGTAATTAGCTGCTCAAAACAGTTATTCTTACATCTATGTGACAATTTTTAGCCAAGTTA
+TAATGAAAAT
TTCGTTTGTAAATATCATTACTTTGGAAGAATCTGTTTTTCCACATTTCGGTCTTCAAATAATCATTTAT
+TTTGCCGCAA
CATTAAAAATTATTGTCAGAATATAGAATGTCATACTTCACTGAGCTCATTATAAAATTTCCAATCAAAC
+TGTATTCAAA
AATGGAAATTAAATTTTTTGGCCATATTTTGCAAATTTTGATGACCCCCAACTTCCAAAAATTGTGAAAA
+TTGATCCGAA
AATTAATTTCCCTAAATCCTTCAAAAAGAAATAGCGATCGTTAGCACTGGTAATTAGCTGCTCAAAACAG
+TTATTCTTAC
ATCTATGTGACAATTTTTAGCCAAGTTATAACGAAAATTTCGTTTGTAAATATCATTACTTTGGCAGAAT
+CTGTTTTTCC
ACATTTCGGTCTTCAAATATCATTTATTTTGCCACAACATTAAAAATAATTGTCTGAATATGGAATGTCA
+TACCTCACTG
AGCTTGTAGTAACATTTCCAATCAAACTGTGTTCAAAAAATGGAAATTACATTTTTTGGTCATATTTTGC
+AAATTTTGAT
GACCCCCGTCCTTATAAAAAATGTGAAAATTGTTCGAAAAATTAATTTCCCTAAATCCTTCAAAAAGTAA
+AAGCGATCGT
TAGCACTGGTAATTAGCTGCTCAAAACAGTTATTCTTACATCTATGTGACAATTTTTAGCCAAGTTATAA
+CGAAAATTTC
GTTTGTAAATATCATTACTTTGGCAGAATCTGTTTTTCCACATTTCGGTCTTCAAATATCATTTATTTTG
+CCACAACATT
AAAAATAATTGTCTGAATATGGAATGTCATACCTCACTGAGCTTGTAGTAAAATTTCCAATCAAAC