Until dataset is 1 line Split dataset into two halves Take intersection of sets Store intersection in duplicate list Split each dataset into two datasets, and repeat end Open original dataset file Until EOD read line compare to list of known duplicates if in that list if duplicate flag not marked emit line to output mark duplicate as emitted endif else emit line on output endif end