How to match duplicate lines in a text file and extract only one of those lines to a new file

danica has asked for the wisdom of the Perl Monks concerning the following question:

Hi guys, I am a complete perl novice so forgive me in advance. I have been searching the web and haven't been able to find a solution. I have a really big text file called "HGDP.txt" that looks like this:

1 51 Brahui A C A A T

1 51 Brahui A C A G T

3 51 Brahui A C A G C

3 51 Brahui A C G A T

5 51 Brahui A C G A T

5 51 Brahui A C G G C

7 51 Brahui A C G A T

7 51 Brahui A C G G T

9 51 Brahui A C G G T

Except that the total number of columns are 2,841. I want to use perl to generate another output file whereby if the first column is of equal value (i.e. duplicates of 1, 3, 5...etc.) then I want to merge the two lines together with a "/" character as a delimiter. For example:

1 51 Brahui A/A C/C A/A A/G T/T

3 51 Brahui A/A C/C A/G G/A C/T

Is there a way to do that?

Comment on How to match duplicate lines in a text file and extract only one of those lines to a new file

Replies are listed 'Best First'.
Re: How to match duplicate lines in a text file and extract only one of those lines to a new file by nemesdani (Friar) on Apr 04, 2012 at 11:41 UTC
There is a way, yes:) However the solution has dependencies: Are all beginning numbers duplicated? Are the following numbers always equal if yes? Etc. The solution can be very specific or very generic, depending on the answers. If you actually try something, post the `code` and we can step further. I'm too lazy to be proud of being impatient.	[reply] [d/l]
Re^2: How to match duplicate lines in a text file and extract only one of those lines to a new file by danica (Initiate) on Apr 04, 2012 at 13:04 UTC
Yes the numbers of the first column are duplicates. The first column is a unique ID number given to an individual. As you probably guessed, the file contains DNA sequences and each individual is allocated 2 rows to represent their alternative alleles. However I need to transform the data so that each individual will only have 1 row instead of 2.	[reply]
Re: How to match duplicate lines in a text file and extract only one of those lines to a new file by ww (Archbishop) on Apr 04, 2012 at 11:34 UTC
Yes. We encourage you to search the solutions already posted here	[reply]
Re: How to match duplicate lines in a text file and extract only one of those lines to a new file by tobyink (Canon) on Apr 04, 2012 at 11:42 UTC
This does the job: my %data; while (<DATA>) { chomp; my ($firstnum, $secondnum, $thingy, @bits) = split /\s/; my $key = sprintf("%s\x00%s\x00%s", $firstnum, $secondnum, $thingy +); for my $i (0 .. $#bits) { $data{$key}[$i] = [] unless exists $data{$key}[$i]; push @{ $data{$key}[$i] }, $bits[$i]; } } foreach my $key (sort keys %data) { print join q[ ], split "\x00", $key; print q[ ]; print join q[ ], map { join '/', @$_ } @{ $data{$key} }; print "\n"; } __DATA__ 1 51 Brahui A C A A T 1 51 Brahui A C A G T 3 51 Brahui A C A G C 3 51 Brahui A C G A T 5 51 Brahui A C G A T 5 51 Brahui A C G G C 7 51 Brahui A C G A T 7 51 Brahui A C G G T 9 51 Brahui A C G G T 9 51 Brahui A C G G T [download] But don't just copy that as-is. Try to understand how it works. What you want to look at is: "I/O Operators" in perlop. `split`, `join` and `map` - see perlfunc. perllol to teach you about nested data structures. `perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'`	[reply] [d/l] [select]
Re^2: How to match duplicate lines in a text file and extract only one of those lines to a new file by danica (Initiate) on Apr 04, 2012 at 13:26 UTC
Hiya, Thank you so much for your help, I tried to run your code just to see how it works. One thing I noticed when I look at the output is that the first column doesn't seem to get transformed. Some duplicates also seem to have been missed. Like so: 1 Brahui A C/C A/A A/G T/T 100 Hazara A C G A T C C 100 Hazara G C A A T C T 102 Hazara A C/C G/G A/G	[reply]
Re^3: How to match duplicate lines in a text file and extract only one of those lines to a new file by aaron_baugher (Curate) on Apr 04, 2012 at 14:37 UTC
In your original sample data, every line began with two integers and then a text string. Now you seem to be running it on lines that begin with a single integer and a text string, so his code is picking up the first allele as part of the duplicated section. Aaron B. My Woefully Neglected Blog, where I occasionally mention Perl.	[reply]
Re^4: How to match duplicate lines in a text file and extract only one of those lines to a new file by danica (Initiate) on Apr 05, 2012 at 09:26 UTC
Re: How to match duplicate lines in a text file and extract only one of those lines to a new file by Sinistral (Monsignor) on Apr 04, 2012 at 13:22 UTC
If this analysis and file combination is a standard procedure in the genetics domain, then someone has probably come up with a solution on the BioPerl site. I tried using their search engine for HGDP, but that didn't give me any search results. If nothing else, you should consult the site to know for future projects all of the work that has already been done.	[reply]
Re: How to match duplicate lines in a text file and extract only one of those lines to a new file by Anonymous Monk on Apr 04, 2012 at 16:10 UTC
Here is my solution: #!/usr/bin/perl use strict; use warnings; my $input_file = 'HGDP.txt'; my $output_file = 'output.txt'; open my $fh, '<', $input_file or die "Unable to open for read $input_file: $!"; open my $out_fh, '>', $output_file or die "Unable to open for write $output_file: $!"; local $, = q{ }; local $\ = "\n"; my @rows; my $static_i = 3; # number of first unjoinable columns sub print_rows { print {$out_fh} @{$rows[0]}[0 .. $static_i - 1], map { my @columns; foreach my $x (0 .. $#rows) { push @columns, $rows[$x][$_]; } join q{/}, @columns; } $static_i .. $#{$rows[0]}; } while (defined(my $line_1 = <$fh>)) { my ($x) = $line_1 =~ /^(\d+)/ or next; push @rows, [split q{ }, $line_1]; while (defined(my $line_2 = <$fh>)) { next unless $line_2 =~ /^\d/; if ($line_2 =~ /^$x\b/) { push @rows, [split q{ }, $line_2]; } else { print_rows(); @rows = [split q{ }, $line_2]; last; } } } print_rows(); close $fh; close $out_fh; __END__\|Output from your example: 1 51 Brahui A/A C/C A/A A/G T/T 3 51 Brahui A/A C/C A/G G/A C/T 5 51 Brahui A/A C/C G/G A/G T/C 7 51 Brahui A/A C/C G/G A/G T/T 9 51 Brahui A/A C/C G/G G/G T/T [download] Note that the code it's not very efficient and is not very well written, but you can try to improve it. Good luck!	[reply] [d/l]
Re^2: How to match duplicate lines in a text file and extract only one of those lines to a new file by danica (Initiate) on Apr 05, 2012 at 09:27 UTC
Thank you very much! Though you said it needs improving, I felt like I understood your code!	[reply]
Re^3: How to match duplicate lines in a text file and extract only one of those lines to a new file by perlnewbie012215 (Novice) on Aug 19, 2015 at 06:25 UTC
Hi Guys, I am new to perl, I have a situation which is very similar to this, where my input rows are given below and I have to find the duplicates on the first column `green apple green grapes blue blueberries orange pappaya orange orange` [download] `Output: green apple/grapes blue blueberries orange pappaya/orange` [download] can one of you guys please explain this code... Thanks	[reply] [d/l] [select]
Re^4: How to match duplicate lines in a text file and extract only one of those lines to a new file by 1nickt (Canon) on Aug 19, 2015 at 10:01 UTC
Re^5: How to match duplicate lines in a text file and extract only one of those lines to a new file by perlnewbie012215 (Novice) on Aug 20, 2015 at 04:23 UTC


Just another Perl shrine
	PerlMonks