Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Parsing and Modifying a flat file in perl

by ad23 (Acolyte)
on Jun 23, 2010 at 17:08 UTC ( #846116=perlquestion: print w/replies, xml ) Need Help??

ad23 has asked for the wisdom of the Perl Monks concerning the following question:

Hello everyone,

I am a Perl newbie, and joined Perl Monks recently. This is my first post.

I am trying to parse a flat file using Perl and I have to modify a few things in it. Here is a part of my input file:

>sequence1 123.3
ATGACGTAGACGATGAGTAGACGATAGCAGTGACAGGTGAGTG\n
ATGACGATGAGTAGAGACGGGGGTAGAGGGGGATAGATAGAGANNNNNNNN\n
ATAGACAGATAANNNNNNNNNNNNNNNNNAGATGAGACAGATANNNNNNN
>sequence2 143.5
ATGCGATGCNNNNNNCGTAGCTGANNNNNNCGATGCTGATGCTC\n
CGTAGTCTGCTAGCTAGTCNNNNNNCGTAGTCGATCGATCGANNNNNNCGTGCATGC\n
CGATGCTACGGATNNNNNCGATCGATCGATCGACNNNNNCGATCAGCTAG\n
CCCCGCTAGTCANNNNN
>sequence3 132.3
ATGCTGATCAGCTACGCTAGCNNNNNCGATCGATCGATCGACTAGCNNNNNNCGATCCGAGCT\n
CGATCGATCGATCGATCGANNNNNCGATCGATCGACTAGCNNNNNCGATCGATCGA\n
CGATCGATCGA
>C1132423 123.4
ATCGTGCATGCATCGATCGACTACGCTGCTACGATCGACTGCTAGCTACGCTAC\n
CGTCGATCGATCGACTACGCTGACTGACTAGCTAG
>C1123234 176.4
GCTAGCGATCGCACCGATCGATCGTACGCTACGATCGATCGATCGATCGACTGT\n
CGATCGATCGATCGATCGATCGA
>C1123546 531.1
CGTAGCTACGATCGATCGATCGACTAGCTACGATCGATCGACTAGCTAGCTAGCTAG

Note: '/n' = sequences are separated with new line.

I am modifying this file(both header & sequence data).
The example output for this file should be :

>count1.1
ATGACGTAGACGATGAGTAGACGATAGCAGTGACAGGTGAGTGATGACGATGAGTAGAGACGGGGGTAGAGGGGGATAGATAGAGA
>count1.2
ATAGACAGATAA
>count1.3
AGATGAGACAGATA
>count2.1
ATGCGATGC
>count2.2
CGTAGCTGA
>count2.3
CGATGCTGATGCTCCGTAGTCTGCTAGCTAGTC
>count2.4
CGTAGTCGATCGATCGA
>count2.5
CGTGCATGCCGATGCTACGGAT
>count2.6
CGATCGATCGATCGACCGATCAGCTAGCCCCGCTAGTCA
>count3.1
ATGCTGATCAGCTACGCTAGC
>count3.2
CGATCGATCGATCGACTAGC
>count3.3
CGATCCGAGCTCGATCGATCGATCGATCGA
>count3.4
CGATCGATCGACTAGC
>count3.5
CGATCGATCGACGATCGATCGA
>count4.1
ATCGTGCATGCATCGATCGACTACGCTGCTACGATCGACTGCTAGCTACGCTACCGTCGATCGATCGACTACGCTGACTGACTAGCTAG
>count5.1
GCTAGCGATCGCACCGATCGATCGTACGCTACGATCGATCGATCGATCGACTGTCGATCGATCGATCGATCGATCGA
>count6.1
CGTAGCTACGATCGATCGATCGACTAGCTACGATCGATCGACTAGCTAGCTAGCTAG

Can someone please help me with this?
I would really appreciate it!!!

Thanks in advance.
  • Comment on Parsing and Modifying a flat file in perl

Replies are listed 'Best First'.
Re: Parsing and Modifying a flat file in perl
by kennethk (Abbot) on Jun 23, 2010 at 17:25 UTC
    What have you tried? Why didn't it work? Please read How do I post a question effectively?. In particular, input and expected output should be wrapped in code tags to maintain formatting. In addition, the mapping from your input to your output is not entirely obvious to me, and so you should explain that. Effort is appreciated around here.

    The following code does something like what you need. Read it, consider it, and understand it. Post specific questions following site guidelines if anything is unclear.

    #!/usr/bin/perl use strict; use warnings; my $buffer = ""; my $series = 1; $_ = <DATA>; # Skip first line while (<DATA>) { if (/>/) { my @elements = split /N+/, $buffer; for my $i (1 .. @elements) { print ">Count$series.$i\n$elements[$i-1]\n"; } $buffer = ""; $series++; } else { chomp; $buffer .= $_; } } my @elements = split /N+/, $buffer; for my $i (1 .. @elements) { print "Count$series.$i\n$elements[$i-1]\n"; } __DATA__ >sequence1 123.3 ATGACGTAGACGATGAGTAGACGATAGCAGTGACAGGTGAGTG ATGACGATGAGTAGAGACGGGGGTAGAGGGGGATAGATAGAGANNNNNNNN ATAGACAGATAANNNNNNNNNNNNNNNNNAGATGAGACAGATANNNNNNN >sequence2 143.5 ATGCGATGCNNNNNNCGTAGCTGANNNNNNCGATGCTGATGCTC CGTAGTCTGCTAGCTAGTCNNNNNNCGTAGTCGATCGATCGANNNNNNCGTGCATGC CGATGCTACGGATNNNNNCGATCGATCGATCGACNNNNNCGATCAGCTAG CCCCGCTAGTCANNNNN >sequence3 132.3 ATGCTGATCAGCTACGCTAGCNNNNNCGATCGATCGATCGACTAGCNNNNNNCGATCCGAGCT CGATCGATCGATCGATCGANNNNNCGATCGATCGACTAGCNNNNNCGATCGATCGA CGATCGATCGA >C1132423 123.4 ATCGTGCATGCATCGATCGACTACGCTGCTACGATCGACTGCTAGCTACGCTAC CGTCGATCGATCGACTACGCTGACTGACTAGCTAG >C1123234 176.4 GCTAGCGATCGCACCGATCGATCGTACGCTACGATCGATCGATCGATCGACTGT CGATCGATCGATCGATCGATCGA >C1123546 531.1 CGTAGCTACGATCGATCGATCGACTAGCTACGATCGATCGACTAGCTAGCTAGCTAG

      I will keep in mind about the things you mentioned above (sorry about Formatting, this was my first post).

      I was trying something like this (just a snippet of my code):

      $scafSeq = $ARGV[0]; open (IN, "< $scafSeq"); while ( $line = <IN> ) { chomp $line; $line =~ s/^\s+//g; $line =~ s/sequence/count/g; next if $line eq ""; if(substr($line,0,1) eq ">") { ($scaff) = $line =~ /^>(\S+)/; } else { $scaffData -> {$scaff} .= $line; } }

      And then I sort the keys and split it with N. Although this approach was sorting the data for >sequence1, etc , it was not working for >C1113456... data.

      Your code is short and effective, to do this job. Thanks!

Re: Parsing and Modifying a flat file in perl
by MidLifeXis (Monsignor) on Jun 23, 2010 at 18:06 UTC

      In particular OP might want to check out Bio::SeqIO::fasta...

      Just a something something...
Re: Parsing and Modifying a flat file in perl
by eighty-one (Curate) on Jun 23, 2010 at 17:34 UTC

    I can't quite make out how you got the output from the input you provided. Maybe I'm just not seeing something obvious, but the rules that applied to the input to produce the output aren't clear to me. So I'll just offer some links that may be of help to a self-described newbie.

    The perlretut Perl regular expression tutorial might be of some help. Also, I'm not sure of how new a newbie you are, but the perlopentut should help you get the file open, if you need that. Once you get the file open you can go through it's contents and use regular expressions to match text based on the rules you provide, and manipulate the text as needed.

    Those articles are both at http://perldoc.perl.org - you might find the rest of the site helpful as well if you need a good reference.

    Also, depending on where you work/study you might have access to Safari, an online service with electronic copies of a ton of good tech books. The O'Riely Perl books are quite good, and might be helpful. 'Programming Perl' was very useful and sat on my desk for a long time while I was learning Perl.

    Sorry I couldn't offer anything more specific. Hopefully those will help.

      Thanks a lot for all the information. I appreciate it.

Re: Parsing and Modifying a flat file in perl
by planetscape (Chancellor) on Jun 23, 2010 at 23:01 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://846116]
Approved by kennethk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (3)
As of 2021-12-06 21:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    R or B?



    Results (33 votes). Check out past polls.

    Notices?