in reply to Removing repeated lines from file
here's a ghetto recipe for this situation where memory is at a premium though, it seems, CPU / time is less so...
1. step thru the document and append a line number to the end of each line. zero pad it to known length for later removal.
2. sort the file.
3. remove the line endings to another file, say, numbers.txt ,maintaining order (they'll be out of order of course, having been shuffled with the sort.
4. step through the sorted file with a uniq-ish algorithm that'll remove consecutive dupes, *but* track the line numbers (as in offset from the begining) being deleted. Delete the corresponding line from the file from 3. i.e. if the 5'th line of text is a dupe and is removed, delete the 5th line in your numbers.txt.
5. when you've gone through all the text, take the numbers.txt file and this time prepend it to the lines in the text file, line for line.
6. resort the file. it should now be in it's original order.
7. remove the line numbers.
it ain't pretty and it's slow and i/o bound, but it won't use much memory.
1. step thru the document and append a line number to the end of each line. zero pad it to known length for later removal.
2. sort the file.
3. remove the line endings to another file, say, numbers.txt ,maintaining order (they'll be out of order of course, having been shuffled with the sort.
4. step through the sorted file with a uniq-ish algorithm that'll remove consecutive dupes, *but* track the line numbers (as in offset from the begining) being deleted. Delete the corresponding line from the file from 3. i.e. if the 5'th line of text is a dupe and is removed, delete the 5th line in your numbers.txt.
5. when you've gone through all the text, take the numbers.txt file and this time prepend it to the lines in the text file, line for line.
6. resort the file. it should now be in it's original order.
7. remove the line numbers.
it ain't pretty and it's slow and i/o bound, but it won't use much memory.
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Re: Removing repeated lines from file
by zengargoyle (Deacon) on Jun 24, 2003 at 21:00 UTC | |
by Notromda (Pilgrim) on Jun 24, 2003 at 22:18 UTC | |
Re: Re: Removing repeated lines from file
by Anonymous Monk on Jun 25, 2003 at 05:23 UTC | |
by Anonymous Monk on Jun 25, 2003 at 16:01 UTC |
In Section
Seekers of Perl Wisdom