Re: Removing repeated lines from file
by broquaint (Abbot) on Jun 24, 2003 at 11:44 UTC
|
Time for one of them nifty one-liners
perl -ni -e 'print if $last ne $_; $last = $_' your_file
See. perlrun for more info.
HTH
_________ broquaint | [reply] [d/l] |
|
I´ve just created a file called repetitive.txt on my Cygwin system and used perl, v5.6.1 to try your one liner but it gave an error:
Can't do inplace edit on repetitive.txt: Permission denied.
And to my surprise the file was deleted. What can be the reason of that behaviour? Did I miss something?
| [reply] |
|
Perl on Windows can't use -i without an extension. I guess the same applies to Cygwin.
Try -ni.bak instead.
--
John.
| [reply] |
|
Re: Removing repeated lines from file
by ant9000 (Monk) on Jun 24, 2003 at 11:47 UTC
|
If you can keep track of lines already read, it's trivial:
%read=();
while(defined($_=<FILE>)){
if(!defined($read_lines{$_})){
print OUTFILE $_;
$read{$_}=1;
}
}
If that's too big for your memory to hold (is it, really?),
you could try to get a unique signature for each line and save that. What about an MD5 hash of it? It's 32 bites per input line, so it could be a good starting point.
Beware, MD5 is not that fast if you have billions of lines! | [reply] [d/l] |
|
perl -ni -e 'print if $seen{$_}; $seen{$_} = 1' your_file
| [reply] [d/l] [select] |
|
Here is another variation:
perl -i.bak -ne 'print unless $h{$_}++' file
However, since these solutions store unique lines in memory they are potentially slow for very large files.
--
John.
| [reply] [d/l] |
|
Ooops... I should never write untested code ;-)
| [reply] |
|
Well, it was already given that storing it all in an array
was taking too much memory. Given that a hash uses even
more overhead than an array, your solution only wins if there
are many duplicates. And while an MD5 hash may only be 32 bits
(4 bytes), a single Perl scalar already takes at least 24 bytes. Plus two handfuls of bytes for being a hash entry.
And the overhead of the hash itself.
Abigail
| [reply] |
|
Although it's unlikely, it is possible that two distinct lines could generate the same MD5 hash. Then you'd erroneously toss one of the lines, thinking it was a duplicate of the first.
| [reply] |
Re: Removing repeated lines from file
by Tomte (Priest) on Jun 24, 2003 at 11:39 UTC
|
You only have to remember the last unique-line, a shot from the hip (untested!):
open(FILE, "<", name.ext") or die("couldn't open file: $!");
open(OUT, ">", new.ext") or die("couldn't open file for writing: $!")
+;
my $cmp = <FILE>; # the first line
print OUT $cmp; # is always unique
while(<FILE>) {
if ($_ ne $cmp) { # current line unlike last one?
print OUT $_; # print it out
$cmp = $_; # current line new reference
} # otherwise ignore
}
regards,
tomte
Hlade's Law:
If you have a difficult task, give it to a lazy person --
they will find an easier way to do it.
| [reply] [d/l] |
Re: Removing repeated lines from file
by BazB (Priest) on Jun 24, 2003 at 11:49 UTC
|
Have a look at the Perl Power Tool (PPT) project's implementation of the UNIX uniq command.
You'll need to sort your data before passing it through Tomte's solution or uniq.
If the information in this post is inaccurate, or just plain wrong, don't just downvote - please post explaining what's wrong.
That way everyone learns.
| [reply] |
|
Considering that one of the requirements was to keep the order
of the line, you'd also need to undo the sort.
Abigail
| [reply] |
|
Each line is made up of 1 or two tab delimited lines. Could I just stick this into a MySQL database and remove duplicated rows using SQL?
| [reply] |
Re: Removing repeated lines from file
by diotalevi (Canon) on Jun 24, 2003 at 11:45 UTC
|
| [reply] [d/l] |
Re: Removing repeated lines from file
by Abigail-II (Bishop) on Jun 24, 2003 at 12:50 UTC
|
IMO, the best way is not to use Perl. There's an excellent
utility doing exactly this, it's called uniq, which
comes standard with any flavour of Unix I've encountered,
and is also present in Unix toolkits for Windows.
uniq in_file out_file
Abigail | [reply] [d/l] |
|
| [reply] [d/l] |
|
uniq works on sorted files only. The man page suggests that it does
remowe consecutive duplicates, but the first sentence states that it
removes repeating lines from sorted files.
Then either your man page lies, or your vendor uses a uniq
implementation that's not confirming to the POSIX standard.
From the POSIX 1003.1 standard:
DESCRIPTION
The uniq utility shall read an input file comparing adjacent
lines, and write one copy of each input line on the output. The second
and succeeding copies of repeated adjacent input lines shall not be
written.
Repeated lines in the input shall not be detected if they are not adjacent.
Abigail
| [reply] |
|
My testing show that uniq does only remove duplicate lines from sorted rows.
| [reply] |
Re: Removing repeated lines from file
by Anonymous Monk on Jun 24, 2003 at 20:13 UTC
|
here's a ghetto recipe for this situation where memory is at a premium though, it seems, CPU / time is less so...
1. step thru the document and append a line number to the end of each line. zero pad it to known length for later removal.
2. sort the file.
3. remove the line endings to another file, say, numbers.txt ,maintaining order (they'll be out of order of course, having been shuffled with the sort.
4. step through the sorted file with a uniq-ish algorithm that'll remove consecutive dupes, *but* track the line numbers (as in offset from the begining) being deleted. Delete the corresponding line from the file from 3. i.e. if the 5'th line of text is a dupe and is removed, delete the 5th line in your numbers.txt.
5. when you've gone through all the text, take the numbers.txt file and this time prepend it to the lines in the text file, line for line.
6. resort the file. it should now be in it's original order.
7. remove the line numbers.
it ain't pretty and it's slow and i/o bound, but it won't use much memory. | [reply] |
|
$ cat -n config.log | head
1 This file contains any messages produced by compilers while
2 running configure, to aid debugging if configure makes a mista
+ke.
3
4 It was created by configure, which was
5 generated by GNU Autoconf 2.57. Invocation command line was
6
7 $ ./configure
8
9 ## --------- ##
10 ## Platform. ##
# lines 3,6,8 are the duplicates in this little test.
# fields are tab seperated.
$ cat -n config.log | head | # number the lines
sort -k 2 | # sort on second field
uniq -f 1 | # uniq skipping first field
sort -k 1,1n | # numeric sort on first field
cut -f2 # extract second field
This file contains any messages produced by compilers while
running configure, to aid debugging if configure makes a mistake.
It was created by configure, which was
generated by GNU Autoconf 2.57. Invocation command line was
$ ./configure
## --------- ##
## Platform. ##
| [reply] [d/l] |
|
Spoken like a true unix fan. :) Simple tools and pipes are cool.
| [reply] |
|
Of course, in reality the problem is un-solvable. A decision needs to be made in order to amend the problem statement and render a solvable problem. Which one of the N copies of any particular duplicate line should be preserved?
| [reply] |
|
| [reply] |
Re: Removing repeated lines from file
by husker (Chaplain) on Jun 24, 2003 at 14:30 UTC
|
The MD5 hash suggestion got me thinking.
It seems like the two big obstacles are 1) the duplicate lines are not necessarily adjacent, and you cannot sort it to make them so, and 2) there's too much data to be held "in place".
What if we could get around obstacle 2? Perhaps if we used some lossless compression on your input, we could reduce it's storage requirement. If the compression is lossless (i.e., the original can be reconstructed with perfect fidelity from it's compressed image), then if we compress two unique lines, their compressed results should also be unique.
Depending on how much compression you are able to get, you may very well be able to process your input "in memory".
OK I guess it really doesn't solve the storage problem per se, just kind of avoids it. It's possible that even with compression, your input stream is just too big. | [reply] |
|
As my duplicate lines tend to be clustered I have tried this code:
while(<INFILE>){
#print "here\n";
if ($_ =~ /^(.{1,50})\t{0,50}(.{0,50})\t{0,1}(.{0,50})\t{0,1}(.{0,
+50})\t{0,1}(.{0,50})\t{0,1}(.{0,50})\t{0,1}(.{0,50})\t{0,1}(.{0,50})\
+t{0,1}(.{0,50})/){
#print "here_B\n";
$bumb = "";
$gene_id = $1;
if ($2 =~ /.{1,50}/){ push (@alternative_ids, $2)}
else{
push (@alternative_ids, $bumb);
}
if ($3 =~ /.{1,50}/){ push (@alternative_ids, $3)}
else{
push (@alternative_ids, $bumb);
}
if ($4 =~ /.{1,50}/){ push (@alternative_ids, $4)}
else{
push (@alternative_ids, $bumb);
}
if ($5 =~ /.{1,50}/){ push (@alternative_ids, $5)}
else{
push (@alternative_ids, $bumb);
}
if ($6 =~ /.{1,50}/){ push (@alternative_ids, $6)}
else{
push (@alternative_ids, $bumb);
}
if ($7 =~ /.{1,50}/){ push (@alternative_ids, $7)}
else{
push (@alternative_ids, $bumb);
}
if ($8 =~ /.{1,50}/){ push (@alternative_ids, $8)}
else{
push (@alternative_ids, $bumb);
}
if ($9 =~ /.{1,50}/){ push (@alternative_ids, $9)}
else{
push (@alternative_ids, $bumb);
}
print @alternative_id;
foreach (@alternative_ids){
$old_line = $new_line;
$alternative_id = $_;
$new_line = "$gene_id\t$alternative_id\n";
@record_previous_lines = qw/blank/;
if ($new_line ne $old_line){ #discount consecutive repeats of
+identical lines
$switch = 1;
foreach(@record_previous_lines){
if ($_ =~ /$new_line.$old_line/){
$switch = 2;
}
}
if ($switch = 1){ # don't print lines that have previously bee
+n repeated
print OUT_B "$new_line$old_line";
}
if (@record_previous_lines > 200){ # don't let this array get
+bigger than 200
shift @record_previous_lines; # shift deletes the first va
+lue of an array
}
push (@record_previous_lines, $new_line.$old_lines);
}
}
undef @alternative_ids;
}
}
The last bit doesn't seem to be working as I want. Can anyone thow light on this? Sorry I am not being more specific. | [reply] [d/l] |
|
| [reply] [d/l] [select] |
|
| [reply] |
|
|
#
# remember everything, probably uses too much memory.
#
{
my %seen;
sub seen_complete {
return 1 if exists $seen{$_[0]};
$seen{$_[0]} = ();
return 0;
}
}
#
# remember last N lines only.
#
{
my %seen;
my $remember = 200;
my @memory;
sub seen_fixed {
return 1 if exists $seen{$_[0]};
delete $seen{shift(@memory)} if @memory > $remember;
push @memory, $_[0];
$seen{$_[0]} = ();
return 0;
}
}
#
# remember N buckets of lines with X lines per bucket.
#
{
my @bucket = ( {} );
my $numbuckets = 2;
my $bucketsize = 200;
sub seen_bucket {
foreach (@bucket) {
return 1 if exists $_->{$_[0]};
}
if (keys %{$bucket[-1]} >= $bucketsize) {
shift @bucket if @bucket >= $numbuckets;
push @bucket, {};
}
$bucket[-1]->{$_[0]} = ();
return 0;
}
}
i only tested the last one, and only sorta tested it at that.
while (<>) {
print unless seen_bucket($_);
}
__END__
Ten sets of 1..400 should get uniq'd to 1..400
$ perl -le 'for(1..10){for(1..400){print}}' | perl dup.pl | wc -l
400
Ten sets of 1..401 should get uniq'd to (1..401) x 10 because
2 buckets of 200 lines hold upto 400
$ perl -le 'for(1..10){for(1..401){print}}' | perl dup.pl | wc -l
4010
| [reply] [d/l] [select] |
Re: Removing repeated lines from file
by husker (Chaplain) on Jun 25, 2003 at 20:27 UTC
|
Until dataset is 1 line
Split dataset into two halves
Take intersection of sets
Store intersection in duplicate list
Split each dataset into two datasets, and repeat
end
Open original dataset file
Until EOD
read line
compare to list of known duplicates
if in that list
if duplicate flag not marked
emit line to output
mark duplicate as emitted
endif
else
emit line on output
endif
end
I thought this was a pretty cool way to generate a list of duplicates. I believe there are modules on CPAN which can do this kind of set operation.
Then I realized it should be much easier:
Sort a copy of the datafile
Open sorted copy
Until EOD
Read line
Compare to previous line
If line == previous line
if line not in duplicate table
put line in duplicate table
endif
else
previous line = line
endif
end
Open original data file
Until EOD
read line
if line in duplicate table
if duplicate not marked
emit line on output
mark duplicate line
end
else
emit line on output
endif
end
Both of these have the advantage of only needing to store the duplicate lines. Both have the disadvantage of having to read through the input set multiple times.
Although the first solution seems more "cool" to me, the second is certainly more practical and likely faster (unless the dataset is so large you can't sort it either).
| [reply] [d/l] [select] |
|
| [reply] |
|
It looked to me like they put the entire file contents in a hash. Mine only puts duplicate lines in the hash.
| [reply] |
Re: Removing repeated lines from file
by Anonymous Monk on Jun 25, 2003 at 19:29 UTC
|
PUT EACH LINE IN A HASH, IT'LL ELIM DUPES FOR YOU AUTOMATICALLY.
open(INFH, "<filename.txt");
my @data=<INFH>;
close(INFH);
open(OUTFH, ">outfile.txt");
my %hashdata = ();
foreach my $thisline (@data)
{ $hashdata{$thisline} = 1; }
foreach my $thisline (sort(keys(%hashdata)))
{ print(OUTFH, $thisline) }
close(OUTFH);
| [reply] [d/l] |