http://qs321.pair.com?node_id=268660


in reply to Removing repeated lines from file

here's a ghetto recipe for this situation where memory is at a premium though, it seems, CPU / time is less so...

1. step thru the document and append a line number to the end of each line. zero pad it to known length for later removal.

2. sort the file.

3. remove the line endings to another file, say, numbers.txt ,maintaining order (they'll be out of order of course, having been shuffled with the sort.

4. step through the sorted file with a uniq-ish algorithm that'll remove consecutive dupes, *but* track the line numbers (as in offset from the begining) being deleted. Delete the corresponding line from the file from 3. i.e. if the 5'th line of text is a dupe and is removed, delete the 5th line in your numbers.txt.

5. when you've gone through all the text, take the numbers.txt file and this time prepend it to the lines in the text file, line for line.

6. resort the file. it should now be in it's original order.

7. remove the line numbers.

it ain't pretty and it's slow and i/o bound, but it won't use much memory.

Replies are listed 'Best First'.
Re: Re: Removing repeated lines from file
by zengargoyle (Deacon) on Jun 24, 2003 at 21:00 UTC

    sigh, ++ to Anonymous

    if you have cool text tools (some sort/uniq/cut are really lame) it's even easier to accomplish

    $ cat -n config.log | head 1 This file contains any messages produced by compilers while 2 running configure, to aid debugging if configure makes a mista +ke. 3 4 It was created by configure, which was 5 generated by GNU Autoconf 2.57. Invocation command line was 6 7 $ ./configure 8 9 ## --------- ## 10 ## Platform. ## # lines 3,6,8 are the duplicates in this little test. # fields are tab seperated. $ cat -n config.log | head | # number the lines sort -k 2 | # sort on second field uniq -f 1 | # uniq skipping first field sort -k 1,1n | # numeric sort on first field cut -f2 # extract second field This file contains any messages produced by compilers while running configure, to aid debugging if configure makes a mistake. It was created by configure, which was generated by GNU Autoconf 2.57. Invocation command line was $ ./configure ## --------- ## ## Platform. ##
      Spoken like a true unix fan. :) Simple tools and pipes are cool.
Re: Re: Removing repeated lines from file
by Anonymous Monk on Jun 25, 2003 at 05:23 UTC
    Of course, in reality the problem is un-solvable. A decision needs to be made in order to amend the problem statement and render a solvable problem. Which one of the N copies of any particular duplicate line should be preserved?
      Hm yes I was thinking the same thing: if the order is important, so is the decision about which one to throw away. Also, in the sample code that matth provides, he's pulling his input lines apart and categorizing them in some way that makes sense only to him, and might be useful for others to know about.

      Can we see some sample input?