Re: How to match duplicate lines in a text file and extract only one of those lines to a new file
by nemesdani (Friar) on Apr 04, 2012 at 11:41 UTC
|
| [reply] [d/l] |
|
Yes the numbers of the first column are duplicates. The first column is a unique ID number given to an individual. As you probably guessed, the file contains DNA sequences and each individual is allocated 2 rows to represent their alternative alleles. However I need to transform the data so that each individual will only have 1 row instead of 2.
| [reply] |
Re: How to match duplicate lines in a text file and extract only one of those lines to a new file
by ww (Archbishop) on Apr 04, 2012 at 11:34 UTC
|
- Yes.
- We encourage you to search the solutions already posted here
| [reply] |
Re: How to match duplicate lines in a text file and extract only one of those lines to a new file
by tobyink (Canon) on Apr 04, 2012 at 11:42 UTC
|
my %data;
while (<DATA>)
{
chomp;
my ($firstnum, $secondnum, $thingy, @bits) = split /\s/;
my $key = sprintf("%s\x00%s\x00%s", $firstnum, $secondnum, $thingy
+);
for my $i (0 .. $#bits)
{
$data{$key}[$i] = [] unless exists $data{$key}[$i];
push @{ $data{$key}[$i] }, $bits[$i];
}
}
foreach my $key (sort keys %data)
{
print join q[ ], split "\x00", $key;
print q[ ];
print join q[ ], map { join '/', @$_ } @{ $data{$key} };
print "\n";
}
__DATA__
1 51 Brahui A C A A T
1 51 Brahui A C A G T
3 51 Brahui A C A G C
3 51 Brahui A C G A T
5 51 Brahui A C G A T
5 51 Brahui A C G G C
7 51 Brahui A C G A T
7 51 Brahui A C G G T
9 51 Brahui A C G G T
9 51 Brahui A C G G T
But don't just copy that as-is. Try to understand how it works. What you want to look at is:
- "I/O Operators" in perlop.
- split, join and map - see perlfunc.
- perllol to teach you about nested data structures.
perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
| [reply] [d/l] [select] |
|
| [reply] |
|
In your original sample data, every line began with two integers and then a text string. Now you seem to be running it on lines that begin with a single integer and a text string, so his code is picking up the first allele as part of the duplicated section.
| [reply] |
|
Re: How to match duplicate lines in a text file and extract only one of those lines to a new file
by Sinistral (Monsignor) on Apr 04, 2012 at 13:22 UTC
|
If this analysis and file combination is a standard procedure in the genetics domain, then someone has probably come up with a solution on the BioPerl site. I tried using their search engine for HGDP, but that didn't give me any search results. If nothing else, you should consult the site to know for future projects all of the work that has already been done.
| [reply] |
Re: How to match duplicate lines in a text file and extract only one of those lines to a new file
by Anonymous Monk on Apr 04, 2012 at 16:10 UTC
|
#!/usr/bin/perl
use strict;
use warnings;
my $input_file = 'HGDP.txt';
my $output_file = 'output.txt';
open my $fh, '<', $input_file
or die "Unable to open for read $input_file: $!";
open my $out_fh, '>', $output_file
or die "Unable to open for write $output_file: $!";
local $, = q{ };
local $\ = "\n";
my @rows;
my $static_i = 3; # number of first unjoinable columns
sub print_rows {
print {$out_fh} @{$rows[0]}[0 .. $static_i - 1], map {
my @columns;
foreach my $x (0 .. $#rows) {
push @columns, $rows[$x][$_];
}
join q{/}, @columns;
} $static_i .. $#{$rows[0]};
}
while (defined(my $line_1 = <$fh>)) {
my ($x) = $line_1 =~ /^(\d+)/ or next;
push @rows, [split q{ }, $line_1];
while (defined(my $line_2 = <$fh>)) {
next unless $line_2 =~ /^\d/;
if ($line_2 =~ /^$x\b/) {
push @rows, [split q{ }, $line_2];
}
else {
print_rows();
@rows = [split q{ }, $line_2];
last;
}
}
}
print_rows();
close $fh;
close $out_fh;
__END__|Output from your example:
1 51 Brahui A/A C/C A/A A/G T/T
3 51 Brahui A/A C/C A/G G/A C/T
5 51 Brahui A/A C/C G/G A/G T/C
7 51 Brahui A/A C/C G/G A/G T/T
9 51 Brahui A/A C/C G/G G/G T/T
Note that the code it's not very efficient and is not very well written, but you can try to improve it. Good luck! | [reply] [d/l] |
|
Thank you very much! Though you said it needs improving, I felt like I understood your code!
| [reply] |
|
Hi Guys,
I am new to perl, I have a situation which is very similar to this, where my input rows are given below and I have to find the duplicates on the first column
green apple
green grapes
blue blueberries
orange pappaya
orange orange
Output:
green apple/grapes
blue blueberries
orange pappaya/orange
can one of you guys please explain this code...
Thanks | [reply] [d/l] [select] |
|
|