Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re: Size of sequences in fastafile

by hippo (Bishop)
on Feb 29, 2020 at 14:08 UTC ( [id://11113576]=note: print w/replies, xml ) Need Help??


in reply to Size of sequences in fastafile

It is difficult to know quite what data structure and form you want from the description (See I know what I mean. Why don't you?). I've assumed you want an array formed with the even lines appended to the odd lines with a space between. Here's an SSCCE to demonstrate one technique for this.

use strict; use warnings; use Test::More tests => 1; my @seq; my @want = ( '>NM_001 Homo sapiens ADA2 (CECR1) GATCCAA', '>NM_002 Homo sapiens IKBKG GGAGGTCTTTAGCTTTAGGGAAACCC', ); while (<DATA>) { chomp; if (/^>/) { push @seq, "$_ "; } else { $seq[-1] .= $_; } } is_deeply \@seq, \@want; __DATA__ >NM_001 Homo sapiens ADA2 (CECR1) GATCCAA >NM_002 Homo sapiens IKBKG GGAGGTCTTTAGCTTTAGGGAAACCC

See also How to ask better questions using Test::More and sample data. HTH.

Replies are listed 'Best First'.
Re^2: Size of sequences in fastafile
by Sofie (Acolyte) on Feb 29, 2020 at 16:17 UTC
    Thanks for the reply. But the code was not what I was looking for. Basically I just want to figure out the length of each sequence in a multifasta file. The files contains many sequences, each sequence has new lines inside, and I want to know how many nucleotides each sequence is and then extract only the sequences with more than x number of nt. Not sure if that makes sense? How to figure out the length of a sequence from a multifasta sequence file? Thanks!

      It's still not very clear precisely what you expect but here's my final stab at this.

      use strict; use warnings; use Test::More tests => 1; # Present this SSCCE as a test my @seq; # Sequences longer than $min are stored here my @want = ( 'GGAGGTCTTTAGCTTTAGGGAAACCC', ); # These are the sequeneces we expect for the given data set my $min = 15; # Minimum length of any sequence to be considered my $this = ''; while (<DATA>) { chomp; if (/^>/) { push @seq, $this if $min <= length $this; $this = ''; } else { $this .= $_; } } push @seq, $this if $min <= length $this; is_deeply \@seq, \@want; # Check that our algorithm has worked __DATA__ >NM_001 Homo sapiens ADA2 (CECR1) GATCCAA >NM_002 Homo sapiens IKBKG GGAGGTCTTTAGCTTTAGGGAAACCC
        This is my data:
        >NM_030643.4 Homo sapiens apolipoprotein L4 (APOL4) GAGGTGCTGGGGAGCAGCGTGTTTGCTGTGCTTGATTGTGAGCTGCTGGGAAGTTGTGACTTTCATTTTA CCTTTCGAATTCCTGGGTATATCTTGGGGGCTGGAGGACGTGTCTGGTTATTATATAGGTGCACAGCTGG AGGTGAGATCCACACAGCTCAGACCAGCTGGATCTTGCTCAGTCTCTGTCAGAGGAAGATCCCTTGGAGG AGGCCCCGCAGCGACATGGAGGGAGCTGCTTTGCTGAAAATCTTTGTCGTCTGCATCTGGAACCAAAATC >NM_001198855.1 Homo sapiens cytochrome P450 family 2 subfamily C memb +er 8 (CYP2C8) ACATGTCAAAGAGACACACACTAAATTAGCAGGGAGTGTTATAAAAACTTTGGAGTGCAAGCTCACAGCT GTCTTAATAAGAAGAGAAGGCTTCAATGGAACCTTTTGTGGTCCTGGTGCTGTGTCTCTCTTTTATGCTT CTCTTTTCACTCTGGAGACAGAGCTGTAGGAGAAGGAAGCTCCCTCCTGGCCCCACTCCTCTTCCTATTA >NR_029834.1 Homo sapiens microRNA 200a (MIR200A), microRNA CCGGGCCCCTGTGAGCATCTTACCGGACAGTGCTGGATTTCCCAGCTTGACTCTAACACTGTCTGGTAAC GATGTTCAAAGGTGACCCGC >AC067940.1 Homo sapiens clone RP11-818E9, LOW-PASS SEQUENCE SAMPLING AAATACAACTTTAAATCAAAACGGTAAAAATTCCACTCTTTCATACTAACTTCAAAAGTATTTGCTTTAA AAAAAAAGNNNNNNNNNNAAACTGAATTTCTATTAAGCATCTATTTATAGAAGAGAGTAAACACCCCGTG AATAAAAGACAGAGAATTGTAGCAGCCCGAAGTCCCTTTTCTCTCCTCCCAAGCATTTGGCTCTGGTCCA AATTCACATATCCTGCTCCGTAAAACAAAGTGCCTTGGTTAACCTAACGTTATTCCTTGAACAGTAGTTT AGTGATCAACTAGTTTTTGTTGTTGTTGTTGTTTGAGACAGAGTCTCACTCTGTCGCCCAGGCTGGAGTG CAGTGGCGAGATCTCAGCTCACTGCAACCTCTGCTGCCCAGGTTCAAGGGATTCTCCTGCCTCAGCCTCC CAAGTAGCTGGTATTACAGGCACCTGCCACCGCGCCTGGCTAATTTTTTTTTTTTTTTTTTTTTGTATTT
        The question is, how can I find out the sequence length of each sequence and extract those of a specific length? I have tried both to put the data in an array, which puts each line in one element, and also to put it in a scalar and tried to use the > to separate each seq, but I am stuck.
        Thank you very much for your efforts. I am obviously not able to explain very well what I am trying to do. This seems to put all the sequences in one single element of the @seq array, and the length includes the header ( not just the seq data).

        I just want to extract all the sequences with a minimal length. But the sequences need to still be in fasta format. How can I use the regex for > to extract the sequence in between the headers and check the length? Since each line of sequence has a newline in between.

        Thanks again and sorry for not being able to explain. /Sofie

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11113576]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (5)
As of 2024-03-28 20:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found