comment on

The MD5 hash suggestion got me thinking.

It seems like the two big obstacles are 1) the duplicate lines are not necessarily adjacent, and you cannot sort it to make them so, and 2) there's too much data to be held "in place".

What if we could get around obstacle 2? Perhaps if we used some lossless compression on your input, we could reduce it's storage requirement. If the compression is lossless (i.e., the original can be reconstructed with perfect fidelity from it's compressed image), then if we compress two unique lines, their compressed results should also be unique.

Depending on how much compression you are able to get, you may very well be able to process your input "in memory".

OK I guess it really doesn't solve the storage problem per se, just kind of avoids it. It's possible that even with compression, your input stream is just too big.

In reply to Re: Removing repeated lines from file by husker
in thread Removing repeated lines from file by matth

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


We don't bite newbies here... much
	PerlMonks