After thinking about this WAY too long, two answers came to me: one kind of obscure, the other much more simple.
The first one used set theory and recursion. It went like this:
Until dataset is 1 line
Split dataset into two halves
Take intersection of sets
Store intersection in duplicate list
Split each dataset into two datasets, and repeat
end
Open original dataset file
Until EOD
read line
compare to list of known duplicates
if in that list
if duplicate flag not marked
emit line to output
mark duplicate as emitted
endif
else
emit line on output
endif
end
I thought this was a pretty cool way to generate a list of duplicates. I believe there are modules on CPAN which can do this kind of set operation.
Then I realized it should be much easier:
Sort a copy of the datafile
Open sorted copy
Until EOD
Read line
Compare to previous line
If line == previous line
if line not in duplicate table
put line in duplicate table
endif
else
previous line = line
endif
end
Open original data file
Until EOD
read line
if line in duplicate table
if duplicate not marked
emit line on output
mark duplicate line
end
else
emit line on output
endif
end
Both of these have the advantage of only needing to store the duplicate lines. Both have the disadvantage of having to read through the input set multiple times.
Although the first solution seems more "cool" to me, the second is certainly more practical and likely faster (unless the dataset is so large you can't sort it either).
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.