We don't bite newbies here... much | |
PerlMonks |
comment on |
( [id://3333]=superdoc: print w/replies, xml ) | Need Help?? |
The MD5 hash suggestion got me thinking.
It seems like the two big obstacles are 1) the duplicate lines are not necessarily adjacent, and you cannot sort it to make them so, and 2) there's too much data to be held "in place". What if we could get around obstacle 2? Perhaps if we used some lossless compression on your input, we could reduce it's storage requirement. If the compression is lossless (i.e., the original can be reconstructed with perfect fidelity from it's compressed image), then if we compress two unique lines, their compressed results should also be unique. Depending on how much compression you are able to get, you may very well be able to process your input "in memory". OK I guess it really doesn't solve the storage problem per se, just kind of avoids it. It's possible that even with compression, your input stream is just too big. In reply to Re: Removing repeated lines from file
by husker
|
|