Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

How to extract lines starting with new names/words

by sm2004 (Acolyte)
on Mar 13, 2008 at 07:31 UTC ( [id://673918]=perlquestion: print w/replies, xml ) Need Help??

sm2004 has asked for the wisdom of the Perl Monks concerning the following question:

I have a text file that looks like:
MA01001A1A03.f1 760 5640111 ad1
MA01001A1A03.f1 760 42572233 ubq
MA01001A1A04.f1 300 15232924 ubq
MA01001A1A04.f1 300 145334669 DNA
MA01001A1B22.f1 580 77745475 ra
MA01001A1B22.f1 580 30409730 ra
How do I write a perl script to extract the lines based on the first word being new in the list? So the extracted list should have only:
MA01001A1A03.f1 760 5640111 ad1
MA01001A1A04.f1 300 15232924 ubq
MA01001A1B22.f1 580 77745475 ra
Any tips on how to get this done is appreciated very much.

Replies are listed 'Best First'.
Re: How to extract lines starting with new names/words
by moritz (Cardinal) on Mar 13, 2008 at 07:37 UTC
    You can keep these first words in a hash and check if they have already been stored:
    #!/usr/bin/perl use strict; use warnings; my %seen_words; while (<DATA>){ if (!m/^(\S+)/){ die "Invalid line: $_"; } my $first_word = $1; if (!$seen_words{$first_word}){ print; $seen_words{$first_word} = 1; } } __DATA__ MA01001A1A03.f1 760 5640111 ad1 MA01001A1A03.f1 760 42572233 ubq MA01001A1A04.f1 300 15232924 ubq MA01001A1A04.f1 300 145334669 DNA MA01001A1B22.f1 580 77745475 ra MA01001A1B22.f1 580 30409730 ra

    This can be written a little bit compacter:

    while (<DATA>){ if (!m/^(\S+)/){ die "Invalid line: $_"; } print unless $seen_words{$1}++; }

    But the first one is easier to read for the beginner ;-)

      Thanks so much! That was perfect. Exactly, what I wanted it to do... I spent several days trying to do this. Just learned perl two weeks ago. Thanks again.
Re: How to extract lines starting with new names/words
by Thilosophy (Curate) on Mar 13, 2008 at 08:47 UTC
    I believe moritz's script will do what you want, but your expected output is confusing:
    MA01001A1A03.f1 760 5640111 ad1 MA01001A1A04.f1 300 15232924 ubq MA01001A1B22.f1 580 77745475 ra
    Should not the second line list the first occurrence of ubq? And what happened to DNA?
    MA01001A1A03.f1 760 5640111 ad1 MA01001A1A03.f1 760 42572233 ubq MA01001A1A04.f1 300 145334669 DNA MA01001A1B22.f1 580 77745475 ra

    Update: Ah, yes...
    As moritz points out (in more polite words) below, I am an idiot. Or tired. I repent, expect swift and adequate punishment, and demand this node be voted down to about -5. (but not much more. please)

      but your expected output is confusing:

      I found that the expected output matches the description very well.

      Should not the second line list the first occurrence of ubq?

      no, because they both start with MA01001A1A03.f1

      And what happened to DNA?
      it starts with the same word as the third line.
Re: How to extract lines starting with new names/words
by poolpi (Hermit) on Mar 13, 2008 at 10:11 UTC

    For example, if the line begins with some comment, you will need another regexp

    #!/usr/bin/perl use strict; use warnings; my $line ; while (<DATA>) { next unless /\A (\w+[.]\w+) \s+ (.+) \z/xms; print unless $line->{ $1 }++; }; __DATA__ # Log file 13/3/2008 MA01001A1A03.f1 760 5640111 ad1 MA01001A1A03.f1 760 42572233 ubq MA01001A1A04.f1 300 15232924 ubq MA01001A1A04.f1 300 145334669 DNA # MA01001A1B22.f1 580 77745475 ra MA01001A1B22.f1 580 30409730 ra MA01001A1A03.f1 760 5640111 foo MA01001A1A04.f1 300 15232924 bar # End of log
    Output: MA01001A1A03.f1 760 5640111 ad1 MA01001A1A04.f1 300 15232924 ubq MA01001A1B22.f1 580 77745475 ra

    hth,

    PooLpi

    'Ebry haffa hoe hab im tik a bush'. Jamaican proverb

    Update : for -> while, thanks johngg ;)

      Your for (<DATA>) would be better written as while (<DATA>). Using for will have the effect of reading the entire file into memory rather than processing a line at a time as with while. Not a problem, perhaps, with small data sets but it's not a good habit to get into.

      Cheers,

      JohnGG

      Thanks a lot. I could use the idea for another file I need to extract data. I'm new to perl and all your input helped a lot.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://673918]
Approved by moritz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (5)
As of 2024-03-28 14:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found