Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

How to match duplicate lines in a text file and extract only one of those lines to a new file

by danica (Initiate)
on Apr 04, 2012 at 11:13 UTC ( [id://963412]=perlquestion: print w/replies, xml ) Need Help??

danica has asked for the wisdom of the Perl Monks concerning the following question:

Hi guys, I am a complete perl novice so forgive me in advance. I have been searching the web and haven't been able to find a solution. I have a really big text file called "HGDP.txt" that looks like this:

1 51 Brahui A C A A T

1 51 Brahui A C A G T

3 51 Brahui A C A G C

3 51 Brahui A C G A T

5 51 Brahui A C G A T

5 51 Brahui A C G G C

7 51 Brahui A C G A T

7 51 Brahui A C G G T

9 51 Brahui A C G G T

9 51 Brahui A C G G T

Except that the total number of columns are 2,841. I want to use perl to generate another output file whereby if the first column is of equal value (i.e. duplicates of 1, 3, 5...etc.) then I want to merge the two lines together with a "/" character as a delimiter. For example:

1 51 Brahui A/A C/C A/A A/G T/T

3 51 Brahui A/A C/C A/G G/A C/T

Is there a way to do that?

  • Comment on How to match duplicate lines in a text file and extract only one of those lines to a new file

Replies are listed 'Best First'.
Re: How to match duplicate lines in a text file and extract only one of those lines to a new file
by nemesdani (Friar) on Apr 04, 2012 at 11:41 UTC
    There is a way, yes:)
    However the solution has dependencies: Are all beginning numbers duplicated? Are the following numbers always equal if yes? Etc. The solution can be very specific or very generic, depending on the answers.
    If you actually try something, post the code and we can step further.

    I'm too lazy to be proud of being impatient.
      Yes the numbers of the first column are duplicates. The first column is a unique ID number given to an individual. As you probably guessed, the file contains DNA sequences and each individual is allocated 2 rows to represent their alternative alleles. However I need to transform the data so that each individual will only have 1 row instead of 2.
Re: How to match duplicate lines in a text file and extract only one of those lines to a new file
by ww (Archbishop) on Apr 04, 2012 at 11:34 UTC
    1. Yes.
    2. We encourage you to search the solutions already posted here
Re: How to match duplicate lines in a text file and extract only one of those lines to a new file
by tobyink (Canon) on Apr 04, 2012 at 11:42 UTC

    This does the job:

    my %data; while (<DATA>) { chomp; my ($firstnum, $secondnum, $thingy, @bits) = split /\s/; my $key = sprintf("%s\x00%s\x00%s", $firstnum, $secondnum, $thingy +); for my $i (0 .. $#bits) { $data{$key}[$i] = [] unless exists $data{$key}[$i]; push @{ $data{$key}[$i] }, $bits[$i]; } } foreach my $key (sort keys %data) { print join q[ ], split "\x00", $key; print q[ ]; print join q[ ], map { join '/', @$_ } @{ $data{$key} }; print "\n"; } __DATA__ 1 51 Brahui A C A A T 1 51 Brahui A C A G T 3 51 Brahui A C A G C 3 51 Brahui A C G A T 5 51 Brahui A C G A T 5 51 Brahui A C G G C 7 51 Brahui A C G A T 7 51 Brahui A C G G T 9 51 Brahui A C G G T 9 51 Brahui A C G G T

    But don't just copy that as-is. Try to understand how it works. What you want to look at is:

    • "I/O Operators" in perlop.
    • split, join and map - see perlfunc.
    • perllol to teach you about nested data structures.
    perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
      Hiya, Thank you so much for your help, I tried to run your code just to see how it works. One thing I noticed when I look at the output is that the first column doesn't seem to get transformed. Some duplicates also seem to have been missed.

      Like so:

      1 Brahui A C/C A/A A/G T/T

      100 Hazara A C G A T C C

      100 Hazara G C A A T C T

      102 Hazara A C/C G/G A/G

        In your original sample data, every line began with two integers and then a text string. Now you seem to be running it on lines that begin with a single integer and a text string, so his code is picking up the first allele as part of the duplicated section.

        Aaron B.
        My Woefully Neglected Blog, where I occasionally mention Perl.

Re: How to match duplicate lines in a text file and extract only one of those lines to a new file
by Sinistral (Monsignor) on Apr 04, 2012 at 13:22 UTC

    If this analysis and file combination is a standard procedure in the genetics domain, then someone has probably come up with a solution on the BioPerl site. I tried using their search engine for HGDP, but that didn't give me any search results. If nothing else, you should consult the site to know for future projects all of the work that has already been done.

Re: How to match duplicate lines in a text file and extract only one of those lines to a new file
by Anonymous Monk on Apr 04, 2012 at 16:10 UTC
    Here is my solution:
    #!/usr/bin/perl use strict; use warnings; my $input_file = 'HGDP.txt'; my $output_file = 'output.txt'; open my $fh, '<', $input_file or die "Unable to open for read $input_file: $!"; open my $out_fh, '>', $output_file or die "Unable to open for write $output_file: $!"; local $, = q{ }; local $\ = "\n"; my @rows; my $static_i = 3; # number of first unjoinable columns sub print_rows { print {$out_fh} @{$rows[0]}[0 .. $static_i - 1], map { my @columns; foreach my $x (0 .. $#rows) { push @columns, $rows[$x][$_]; } join q{/}, @columns; } $static_i .. $#{$rows[0]}; } while (defined(my $line_1 = <$fh>)) { my ($x) = $line_1 =~ /^(\d+)/ or next; push @rows, [split q{ }, $line_1]; while (defined(my $line_2 = <$fh>)) { next unless $line_2 =~ /^\d/; if ($line_2 =~ /^$x\b/) { push @rows, [split q{ }, $line_2]; } else { print_rows(); @rows = [split q{ }, $line_2]; last; } } } print_rows(); close $fh; close $out_fh; __END__|Output from your example: 1 51 Brahui A/A C/C A/A A/G T/T 3 51 Brahui A/A C/C A/G G/A C/T 5 51 Brahui A/A C/C G/G A/G T/C 7 51 Brahui A/A C/C G/G A/G T/T 9 51 Brahui A/A C/C G/G G/G T/T
    Note that the code it's not very efficient and is not very well written, but you can try to improve it. Good luck!
      Thank you very much! Though you said it needs improving, I felt like I understood your code!

        Hi Guys, I am new to perl, I have a situation which is very similar to this, where my input rows are given below and I have to find the duplicates on the first column

        green apple green grapes blue blueberries orange pappaya orange orange
        Output: green apple/grapes blue blueberries orange pappaya/orange

        can one of you guys please explain this code... Thanks

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://963412]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (6)
As of 2024-04-25 09:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found