Removing repeated lines from file

matth has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Removing repeated lines from file by broquaint (Abbot) on Jun 24, 2003 at 11:44 UTC
Time for one of them nifty one-liners `perl -ni -e 'print if $last ne $_; $last = $_' your_file` [download] See. `perlrun` for more info. HTH `_________ broquaint`	[reply] [d/l]
Re: Re: Removing repeated lines from file by YAFZ (Pilgrim) on Jun 24, 2003 at 11:54 UTC
I´ve just created a file called repetitive.txt on my Cygwin system and used perl, v5.6.1 to try your one liner but it gave an error: Can't do inplace edit on repetitive.txt: Permission denied. And to my surprise the file was deleted. What can be the reason of that behaviour? Did I miss something?	[reply]
Re: Re: Re: Removing repeated lines from file by jmcnamara (Monsignor) on Jun 24, 2003 at 12:35 UTC
Perl on Windows can't use `-i` without an extension. I guess the same applies to Cygwin. Try `-ni.bak` instead. -- John.	[reply]
Re: Re: Re: Re: Removing repeated lines from file by YAFZ (Pilgrim) on Jun 24, 2003 at 12:44 UTC
Re: Removing repeated lines from file by ant9000 (Monk) on Jun 24, 2003 at 11:47 UTC
If you can keep track of lines already read, it's trivial: `%read=(); while(defined($_=<FILE>)){ if(!defined($read_lines{$_})){ print OUTFILE $_; $read{$_}=1; } }` [download] If that's too big for your memory to hold (is it, really?), you could try to get a unique signature for each line and save that. What about an MD5 hash of it? It's 32 bites per input line, so it could be a good starting point. Beware, MD5 is not that fast if you have billions of lines!	[reply] [d/l]
Re: Re: Removing repeated lines from file by zby (Vicar) on Jun 24, 2003 at 12:08 UTC
This is the only solution, as far, that solves the general problem of repeating lines not only consecutive repeating lines ++. Just fix the name of the hash, now you use two: `%read_lines` and `%read` for what should be one I think. A compression (broquaint based): `perl -ni -e 'print if $seen{$_}; $seen{$_} = 1' your_file` [download]	[reply] [d/l] [select]
Re: Re: Re: Removing repeated lines from file by jmcnamara (Monsignor) on Jun 24, 2003 at 12:25 UTC
Here is another variation: `perl -i.bak -ne 'print unless $h{$_}++' file` [download] However, since these solutions store unique lines in memory they are potentially slow for very large files. -- John.	[reply] [d/l]
Re: Re: Re: Removing repeated lines from file by ant9000 (Monk) on Jun 24, 2003 at 13:06 UTC
Ooops... I should never write untested code ;-)	[reply]
Re: Removing repeated lines from file by Abigail-II (Bishop) on Jun 24, 2003 at 12:48 UTC
Well, it was already given that storing it all in an array was taking too much memory. Given that a hash uses even more overhead than an array, your solution only wins if there are many duplicates. And while an MD5 hash may only be 32 bits (4 bytes), a single Perl scalar already takes at least 24 bytes. Plus two handfuls of bytes for being a hash entry. And the overhead of the hash itself. Abigail	[reply]
Re: Re: Removing repeated lines from file by husker (Chaplain) on Jun 24, 2003 at 13:57 UTC
Although it's unlikely, it is possible that two distinct lines could generate the same MD5 hash. Then you'd erroneously toss one of the lines, thinking it was a duplicate of the first.	[reply]
Re: Removing repeated lines from file by Tomte (Priest) on Jun 24, 2003 at 11:39 UTC
You only have to remember the last unique-line, a shot from the hip (untested!): `open(FILE, "<", name.ext") or die("couldn't open file: $!"); open(OUT, ">", new.ext") or die("couldn't open file for writing: $!") +; my $cmp = <FILE>; # the first line print OUT $cmp; # is always unique while(<FILE>) { if ($_ ne $cmp) { # current line unlike last one? print OUT $_; # print it out $cmp = $_; # current line new reference } # otherwise ignore }` [download] regards, tomte Hlade's Law: If you have a difficult task, give it to a lazy person -- they will find an easier way to do it.	[reply] [d/l]
Re: Removing repeated lines from file by BazB (Priest) on Jun 24, 2003 at 11:49 UTC
Have a look at the Perl Power Tool (PPT) project's implementation of the UNIX uniq command. You'll need to sort your data before passing it through Tomte's solution or uniq. If the information in this post is inaccurate, or just plain wrong, don't just downvote - please post explaining what's wrong. That way everyone learns.	[reply]
Re: Removing repeated lines from file by Abigail-II (Bishop) on Jun 24, 2003 at 12:40 UTC
Considering that one of the requirements was to keep the order of the line, you'd also need to undo the sort. Abigail	[reply]
Re: Re: Removing repeated lines from file by matth (Monk) on Jun 24, 2003 at 12:36 UTC
Each line is made up of 1 or two tab delimited lines. Could I just stick this into a MySQL database and remove duplicated rows using SQL?	[reply]
Re: Removing repeated lines from file by diotalevi (Canon) on Jun 24, 2003 at 11:45 UTC
This is a one-liner. If you add in the '-ibck' flag then you can modify the file directly. `perl -ne 'next if $prev eq $_; print; $prev = $_'`	[reply] [d/l]
Re: Removing repeated lines from file by Abigail-II (Bishop) on Jun 24, 2003 at 12:50 UTC
IMO, the best way is not to use Perl. There's an excellent utility doing exactly this, it's called `uniq`, which comes standard with any flavour of Unix I've encountered, and is also present in Unix toolkits for Windows. `uniq in_file out_file` [download] Abigail	[reply] [d/l]
Re: Re: Removing repeated lines from file by zby (Vicar) on Jun 24, 2003 at 12:55 UTC
`uniq` works on sorted files only. The man page suggests that it does remowe consecutive duplicates, but the first sentence states that it removes repeating lines from sorted files. Update: As Abigail-II pointed out according to POSIX the sugggestion is right. But anyway it works only for consecutive duplicates.	[reply] [d/l]
Re: Removing repeated lines from file by Abigail-II (Bishop) on Jun 24, 2003 at 13:19 UTC
`uniq` works on sorted files only. The man page suggests that it does remowe consecutive duplicates, but the first sentence states that it removes repeating lines from sorted files. Then either your man page lies, or your vendor uses a `uniq` implementation that's not confirming to the POSIX standard. From the POSIX 1003.1 standard: DESCRIPTION The uniq utility shall read an input file comparing adjacent lines, and write one copy of each input line on the output. The second and succeeding copies of repeated adjacent input lines shall not be written. Repeated lines in the input shall not be detected if they are not adjacent. Abigail	[reply]
Re: Re: Re: Removing repeated lines from file by matth (Monk) on Jun 24, 2003 at 13:21 UTC
My testing show that uniq does only remove duplicate lines from sorted rows.	[reply]
Re: Removing repeated lines from file by Anonymous Monk on Jun 24, 2003 at 20:13 UTC
here's a ghetto recipe for this situation where memory is at a premium though, it seems, CPU / time is less so... 1. step thru the document and append a line number to the end of each line. zero pad it to known length for later removal. 2. sort the file. 3. remove the line endings to another file, say, numbers.txt ,maintaining order (they'll be out of order of course, having been shuffled with the sort. 4. step through the sorted file with a uniq-ish algorithm that'll remove consecutive dupes, but track the line numbers (as in offset from the begining) being deleted. Delete the corresponding line from the file from 3. i.e. if the 5'th line of text is a dupe and is removed, delete the 5th line in your numbers.txt. 5. when you've gone through all the text, take the numbers.txt file and this time prepend it to the lines in the text file, line for line. 6. resort the file. it should now be in it's original order. 7. remove the line numbers. it ain't pretty and it's slow and i/o bound, but it won't use much memory.	[reply]
Re: Re: Removing repeated lines from file by zengargoyle (Deacon) on Jun 24, 2003 at 21:00 UTC
sigh, ++ to Anonymous if you have cool text tools (some sort/uniq/cut are really lame) it's even easier to accomplish $ cat -n config.log \| head 1 This file contains any messages produced by compilers while 2 running configure, to aid debugging if configure makes a mista +ke. 3 4 It was created by configure, which was 5 generated by GNU Autoconf 2.57. Invocation command line was 6 7 $ ./configure 8 9 ## --------- ## 10 ## Platform. ## # lines 3,6,8 are the duplicates in this little test. # fields are tab seperated. $ cat -n config.log \| head \| # number the lines sort -k 2 \| # sort on second field uniq -f 1 \| # uniq skipping first field sort -k 1,1n \| # numeric sort on first field cut -f2 # extract second field This file contains any messages produced by compilers while running configure, to aid debugging if configure makes a mistake. It was created by configure, which was generated by GNU Autoconf 2.57. Invocation command line was $ ./configure ## --------- ## ## Platform. ## [download]	[reply] [d/l]
Re: Re: Re: Removing repeated lines from file by Notromda (Pilgrim) on Jun 24, 2003 at 22:18 UTC
Spoken like a true unix fan. :) Simple tools and pipes are cool.	[reply]
Re: Re: Removing repeated lines from file by Anonymous Monk on Jun 25, 2003 at 05:23 UTC
Of course, in reality the problem is un-solvable. A decision needs to be made in order to amend the problem statement and render a solvable problem. Which one of the N copies of any particular duplicate line should be preserved?	[reply]
Re: Re: Re: Removing repeated lines from file by Anonymous Monk on Jun 25, 2003 at 16:01 UTC
Hm yes I was thinking the same thing: if the order is important, so is the decision about which one to throw away. Also, in the sample code that matth provides, he's pulling his input lines apart and categorizing them in some way that makes sense only to him, and might be useful for others to know about. Can we see some sample input?	[reply]
Re: Removing repeated lines from file by husker (Chaplain) on Jun 24, 2003 at 14:30 UTC
The MD5 hash suggestion got me thinking. It seems like the two big obstacles are 1) the duplicate lines are not necessarily adjacent, and you cannot sort it to make them so, and 2) there's too much data to be held "in place". What if we could get around obstacle 2? Perhaps if we used some lossless compression on your input, we could reduce it's storage requirement. If the compression is lossless (i.e., the original can be reconstructed with perfect fidelity from it's compressed image), then if we compress two unique lines, their compressed results should also be unique. Depending on how much compression you are able to get, you may very well be able to process your input "in memory". OK I guess it really doesn't solve the storage problem per se, just kind of avoids it. It's possible that even with compression, your input stream is just too big.	[reply]
Re: Re: Removing repeated lines from file by matth (Monk) on Jun 24, 2003 at 14:39 UTC
As my duplicate lines tend to be clustered I have tried this code: while(<INFILE>){ #print "here\n"; if ($_ =~ /^(.{1,50})\t{0,50}(.{0,50})\t{0,1}(.{0,50})\t{0,1}(.{0, +50})\t{0,1}(.{0,50})\t{0,1}(.{0,50})\t{0,1}(.{0,50})\t{0,1}(.{0,50})\ +t{0,1}(.{0,50})/){ #print "here_B\n"; $bumb = ""; $gene_id = $1; if ($2 =~ /.{1,50}/){ push (@alternative_ids, $2)} else{ push (@alternative_ids, $bumb); } if ($3 =~ /.{1,50}/){ push (@alternative_ids, $3)} else{ push (@alternative_ids, $bumb); } if ($4 =~ /.{1,50}/){ push (@alternative_ids, $4)} else{ push (@alternative_ids, $bumb); } if ($5 =~ /.{1,50}/){ push (@alternative_ids, $5)} else{ push (@alternative_ids, $bumb); } if ($6 =~ /.{1,50}/){ push (@alternative_ids, $6)} else{ push (@alternative_ids, $bumb); } if ($7 =~ /.{1,50}/){ push (@alternative_ids, $7)} else{ push (@alternative_ids, $bumb); } if ($8 =~ /.{1,50}/){ push (@alternative_ids, $8)} else{ push (@alternative_ids, $bumb); } if ($9 =~ /.{1,50}/){ push (@alternative_ids, $9)} else{ push (@alternative_ids, $bumb); } print @alternative_id; foreach (@alternative_ids){ $old_line = $new_line; $alternative_id = $_; $new_line = "$gene_id\t$alternative_id\n"; @record_previous_lines = qw/blank/; if ($new_line ne $old_line){ #discount consecutive repeats of +identical lines $switch = 1; foreach(@record_previous_lines){ if ($_ =~ /$new_line.$old_line/){ $switch = 2; } } if ($switch = 1){ # don't print lines that have previously bee +n repeated print OUT_B "$new_line$old_line"; } if (@record_previous_lines > 200){ # don't let this array get +bigger than 200 shift @record_previous_lines; # shift deletes the first va +lue of an array } push (@record_previous_lines, $new_line.$old_lines); } } undef @alternative_ids; } } [download] The last bit doesn't seem to be working as I want. Can anyone thow light on this? Sorry I am not being more specific.	[reply] [d/l]
Re: Re: Re: Removing repeated lines from file by CountZero (Bishop) on Jun 25, 2003 at 06:12 UTC
In `if ($switch = 1){ # don't print lines that have previously been +repeated print OUT_B "$new_line$old_line"; }` [download] should that not be `if ($switch`== `1)` CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply] [d/l] [select]
Re: Re: Re: Removing repeated lines from file by husker (Chaplain) on Jun 24, 2003 at 15:04 UTC
OK well if you had said "I expect to find duplicate lines within 200 lines of the original" then we could have helped you sooner! :) So ... how do you want that "last bit" to work, and how is it really working?	[reply]
Re: Re: Re: Re: Removing repeated lines from file by matth (Monk) on Jun 24, 2003 at 15:10 UTC
Re: Re: Re: Removing repeated lines from file by zengargoyle (Deacon) on Jun 25, 2003 at 22:24 UTC
how clustered are your lines? here are a couple tweakable previous line recognizers. # # remember everything, probably uses too much memory. # { my %seen; sub seen_complete { return 1 if exists $seen{$_[0]}; $seen{$_[0]} = (); return 0; } } # # remember last N lines only. # { my %seen; my $remember = 200; my @memory; sub seen_fixed { return 1 if exists $seen{$_[0]}; delete $seen{shift(@memory)} if @memory > $remember; push @memory, $_[0]; $seen{$_[0]} = (); return 0; } } # # remember N buckets of lines with X lines per bucket. # { my @bucket = ( {} ); my $numbuckets = 2; my $bucketsize = 200; sub seen_bucket { foreach (@bucket) { return 1 if exists $_->{$_[0]}; } if (keys %{$bucket[-1]} >= $bucketsize) { shift @bucket if @bucket >= $numbuckets; push @bucket, {}; } $bucket[-1]->{$_[0]} = (); return 0; } } [download] i only tested the last one, and only sorta tested it at that. `while (<>) { print unless seen_bucket($_); } __END__ Ten sets of 1..400 should get uniq'd to 1..400 $ perl -le 'for(1..10){for(1..400){print}}' \| perl dup.pl \| wc -l 400 Ten sets of 1..401 should get uniq'd to (1..401) x 10 because 2 buckets of 200 lines hold upto 400 $ perl -le 'for(1..10){for(1..401){print}}' \| perl dup.pl \| wc -l 4010` [download]	[reply] [d/l] [select]
Re: Removing repeated lines from file by husker (Chaplain) on Jun 25, 2003 at 20:27 UTC
After thinking about this WAY too long, two answers came to me: one kind of obscure, the other much more simple. The first one used set theory and recursion. It went like this: `Until dataset is 1 line Split dataset into two halves Take intersection of sets Store intersection in duplicate list Split each dataset into two datasets, and repeat end Open original dataset file Until EOD read line compare to list of known duplicates if in that list if duplicate flag not marked emit line to output mark duplicate as emitted endif else emit line on output endif end` [download] I thought this was a pretty cool way to generate a list of duplicates. I believe there are modules on CPAN which can do this kind of set operation. Then I realized it should be much easier: `Sort a copy of the datafile Open sorted copy Until EOD Read line Compare to previous line If line == previous line if line not in duplicate table put line in duplicate table endif else previous line = line endif end Open original data file Until EOD read line if line in duplicate table if duplicate not marked emit line on output mark duplicate line end else emit line on output endif end` [download] Both of these have the advantage of only needing to store the duplicate lines. Both have the disadvantage of having to read through the input set multiple times. Although the first solution seems more "cool" to me, the second is certainly more practical and likely faster (unless the dataset is so large you can't sort it either).	[reply] [d/l] [select]
Re^2: Removing repeated lines from file by Aristotle (Chancellor) on Jun 30, 2003 at 22:52 UTC
Your second solution is exactly what the propositions using a hash all do. Makeshifts last the longest.	[reply]
Re: Re^2: Removing repeated lines from file by husker (Chaplain) on Jul 02, 2003 at 17:55 UTC
It looked to me like they put the entire file contents in a hash. Mine only puts duplicate lines in the hash.	[reply]
Re: Removing repeated lines from file by Anonymous Monk on Jun 25, 2003 at 19:29 UTC
PUT EACH LINE IN A HASH, IT'LL ELIM DUPES FOR YOU AUTOMATICALLY. `open(INFH, "<filename.txt"); my @data=<INFH>; close(INFH); open(OUTFH, ">outfile.txt"); my %hashdata = (); foreach my $thisline (@data) { $hashdata{$thisline} = 1; } foreach my $thisline (sort(keys(%hashdata))) { print(OUTFH, $thisline) } close(OUTFH);` [download]	[reply] [d/l]


Keep It Simple, Stupid
	PerlMonks