http://qs321.pair.com?node_id=268660


in reply to Removing repeated lines from file

here's a ghetto recipe for this situation where memory is at a premium though, it seems, CPU / time is less so...

1. step thru the document and append a line number to the end of each line. zero pad it to known length for later removal.

2. sort the file.

3. remove the line endings to another file, say, numbers.txt ,maintaining order (they'll be out of order of course, having been shuffled with the sort.

4. step through the sorted file with a uniq-ish algorithm that'll remove consecutive dupes, *but* track the line numbers (as in offset from the begining) being deleted. Delete the corresponding line from the file from 3. i.e. if the 5'th line of text is a dupe and is removed, delete the 5th line in your numbers.txt.

5. when you've gone through all the text, take the numbers.txt file and this time prepend it to the lines in the text file, line for line.

6. resort the file. it should now be in it's original order.

7. remove the line numbers.

it ain't pretty and it's slow and i/o bound, but it won't use much memory.