comment on

After thinking about this WAY too long, two answers came to me: one kind of obscure, the other much more simple.

The first one used set theory and recursion. It went like this:

    Until dataset is 1 line
       Split dataset into two halves
       Take intersection of sets
       Store intersection in duplicate list
       Split each dataset into two datasets, and repeat
    end
    Open original dataset file
    Until EOD
        read line
        compare to list of known duplicates
        if in that list
           if duplicate flag not marked
              emit line to output 
              mark duplicate as emitted
           endif
        else
           emit line on output
        endif      
    end
[download]

I thought this was a pretty cool way to generate a list of duplicates. I believe there are modules on CPAN which can do this kind of set operation.

Then I realized it should be much easier:

    Sort a copy of the datafile
    Open sorted copy   
    Until EOD     
       Read line
       Compare to previous line
       If line == previous line
          if line not in duplicate table
              put line in duplicate table
          endif
       else 
          previous line = line
       endif
     end
     Open original data file
     Until EOD
        read line
        if line in duplicate table
           if duplicate not marked
              emit line on output
              mark duplicate line
           end
        else
           emit line on output 
        endif 
     end
[download]

Both of these have the advantage of only needing to store the duplicate lines. Both have the disadvantage of having to read through the input set multiple times.

Although the first solution seems more "cool" to me, the second is certainly more practical and likely faster (unless the dataset is so large you can't sort it either).

In reply to Re: Removing repeated lines from file by husker
in thread Removing repeated lines from file by matth

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Clear questions and runnable code get the best and fastest answer
	PerlMonks