Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

splitting data advice requested

by Angharad (Pilgrim)
on May 13, 2009 at 14:05 UTC ( [id://763765]=perlquestion: print w/replies, xml ) Need Help??

Angharad has asked for the wisdom of the Perl Monks concerning the following question:

I have a file that looks like this
>SEQ1 -----I--RL--AAIDVDG-NLT----------D--R--D-RL-ISTKA-IESIRS--A- -E-K--------K-GLT-VSL----LS------GN-V----I-PVV---YA-L------K IF---------------L-----GINGPVF------------------------------ >SEQ2 -MKI----KA--ISIDIDG-TIT------YPN-R-------MIHEK--A-LEAIRR--A- -E-S--------L-GIP-IML----VT------GN-T----V-QFA---EA-A------S IL---------------I-----G----TS-----------------GP-VV-------- >SEQ3 --KI----KA--ISIDIDG-TIT------YPN-R-------MIHEK--A-LEAIRR--A- -E-S--------L-GIP-IML----VT------GN-T----V-QFA---EA-A------S IL---------------I-----G----TS-----------------GP-VV-------- ---AE--D------GG---A---------------------------------------I
The lines starting with '>' are the headers and the series of letters and dashes underneath each header is the 'sequence' associated with that header. I would like to split this data so that the header is one element in the array and the sequence data underneath is in another so - for example
element[0] = >SEQ1 element[1] = -----I--RL--AAIDVDG-NLT----------D--R--D-RL-ISTKA-IESIRS- +-A- -E-K--------K-GLT-VSL----LS------GN-V----I-PVV---YA-L------K IF---------------L-----GINGPVF------------------------------
On first sight, it seems a fairly easy split operation, and I thought initially I could just split on new line, but I cant due to the sequence data occurring over several lines in the file. Any advise/thoughts on how I might be able to do this much appreciated. Thanks in advance.

Replies are listed 'Best First'.
Re: splitting data advice requested
by kennethk (Abbot) on May 13, 2009 at 14:44 UTC
    Combining bloodnok's suggestion for a hash with almut's suggestion for an approach to splitting combined with split's limit argument and a positive look-ahead assertion, I give you:

    use strict; use warnings; local $/; my %hash = map split(/\n/, $_, 2), split /\n(?=>)/, <DATA>; s/\n//g foreach values %hash; __DATA__ >SEQ1 -----I--RL--AAIDVDG-NLT----------D--R--D-RL-ISTKA-IESIRS--A- -E-K--------K-GLT-VSL----LS------GN-V----I-PVV---YA-L------K IF---------------L-----GINGPVF------------------------------ >SEQ2 -MKI----KA--ISIDIDG-TIT------YPN-R-------MIHEK--A-LEAIRR--A- -E-S--------L-GIP-IML----VT------GN-T----V-QFA---EA-A------S IL---------------I-----G----TS-----------------GP-VV-------- >SEQ3 --KI----KA--ISIDIDG-TIT------YPN-R-------MIHEK--A-LEAIRR--A- -E-S--------L-GIP-IML----VT------GN-T----V-QFA---EA-A------S IL---------------I-----G----TS-----------------GP-VV-------- ---AE--D------GG---A---------------------------------------I

      Another way of getting rid of the newlines would be to lose all of them with the second split, passing the fields out in an anonymous array, and then map out the key and sequence using shift and join.

      use strict; use warnings; use Data::Dumper; my %hash = map { ( shift @$_, join q{}, @$_ ) } map { [ split m{\n} ] } map { split m{\n(?=>)} } do { local $/; <DATA> }; print Data::Dumper->Dumpxs( [ \ %hash ], [ qw{ *hash } ] ); __DATA__ >SEQ1 -----I--RL--AAIDVDG-NLT----------D--R--D-RL-ISTKA-IESIRS--A- -E-K--------K-GLT-VSL----LS------GN-V----I-PVV---YA-L------K IF---------------L-----GINGPVF------------------------------ >SEQ2 -MKI----KA--ISIDIDG-TIT------YPN-R-------MIHEK--A-LEAIRR--A- -E-S--------L-GIP-IML----VT------GN-T----V-QFA---EA-A------S IL---------------I-----G----TS-----------------GP-VV-------- >SEQ3 --KI----KA--ISIDIDG-TIT------YPN-R-------MIHEK--A-LEAIRR--A- -E-S--------L-GIP-IML----VT------GN-T----V-QFA---EA-A------S IL---------------I-----G----TS-----------------GP-VV-------- ---AE--D------GG---A---------------------------------------I

      The output.

      %hash = ( '>SEQ1' => '-----I--RL--AAIDVDG-NLT----------D--R--D-RL-ISTK +A-IESIRS--A--E-K--------K-GLT-VSL----LS------GN-V----I-PVV---YA-L---- +--KIF---------------L-----GINGPVF------------------------------', '>SEQ3' => '--KI----KA--ISIDIDG-TIT------YPN-R-------MIHEK-- +A-LEAIRR--A--E-S--------L-GIP-IML----VT------GN-T----V-QFA---EA-A---- +--SIL---------------I-----G----TS-----------------GP-VV-----------AE- +-D------GG---A---------------------------------------I', '>SEQ2' => '-MKI----KA--ISIDIDG-TIT------YPN-R-------MIHEK-- +A-LEAIRR--A--E-S--------L-GIP-IML----VT------GN-T----V-QFA---EA-A---- +--SIL---------------I-----G----TS-----------------GP-VV--------' );

      I hope this is of interest.

      Cheers,

      JohnGG

Re: splitting data advice requested
by VinsWorldcom (Prior) on May 13, 2009 at 14:13 UTC
    #!/usr/bin/perl use strict; use Data::Dumper; open (IN, "in.txt"); my @element; my $data; while (<IN>) { chomp ($_); if ($_ =~ /^>/) { if ($data) { push @element, $data; $data = ''; } push @element, $_; } else { $data .= $_ . "\n" } } push @element, $data; print Dumper \@element;
Re: splitting data advice requested
by bichonfrise74 (Vicar) on May 13, 2009 at 21:00 UTC
    How about this?
    #!/usr/bin/perl use strict; use Data::Dumper; local $/ = ">"; my %hash; while (<DATA>) { s/>//; s/(^SEQ\d)//; $hash{">". $1} = $_ if ( defined( $1 )); } print Dumper(\%hash); __DATA__ >SEQ1 -----I--RL--AAIDVDG-NLT----------D--R--D-RL-ISTKA-IESIRS--A- -E-K--------K-GLT-VSL----LS------GN-V----I-PVV---YA-L------K IF---------------L-----GINGPVF------------------------------ >SEQ2 -MKI----KA--ISIDIDG-TIT------YPN-R-------MIHEK--A-LEAIRR--A- -E-S--------L-GIP-IML----VT------GN-T----V-QFA---EA-A------S IL---------------I-----G----TS-----------------GP-VV-------- >SEQ3 --KI----KA--ISIDIDG-TIT------YPN-R-------MIHEK--A-LEAIRR--A- -E-S--------L-GIP-IML----VT------GN-T----V-QFA---EA-A------S IL---------------I-----G----TS-----------------GP-VV-------- ---AE--D------GG---A---------------------------------------I
      I like your approach, because it avoids the problems of the "full file slurping" options suggested above (ie. won't scale for very large input files), and it recognises that '>' is a handy delimiter here. A small suggested improvement so that it removes the line breaks as per the OP, and is a bit shorter:
      ... while (<DATA>) { $hash{">$1"} = $2 if s/[>\n]//g && /^(SEQ\d+)(.*)/; } ...
Re: splitting data advice requested
by Bloodnok (Vicar) on May 13, 2009 at 14:13 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://763765]
Approved by rovf
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (3)
As of 2024-04-25 22:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found