Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
here's a ghetto recipe for this situation where memory is at a premium though, it seems, CPU / time is less so...

1. step thru the document and append a line number to the end of each line. zero pad it to known length for later removal.

2. sort the file.

3. remove the line endings to another file, say, numbers.txt ,maintaining order (they'll be out of order of course, having been shuffled with the sort.

4. step through the sorted file with a uniq-ish algorithm that'll remove consecutive dupes, *but* track the line numbers (as in offset from the begining) being deleted. Delete the corresponding line from the file from 3. i.e. if the 5'th line of text is a dupe and is removed, delete the 5th line in your numbers.txt.

5. when you've gone through all the text, take the numbers.txt file and this time prepend it to the lines in the text file, line for line.

6. resort the file. it should now be in it's original order.

7. remove the line numbers.

it ain't pretty and it's slow and i/o bound, but it won't use much memory.

In reply to Re: Removing repeated lines from file by Anonymous Monk
in thread Removing repeated lines from file by matth

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (6)
As of 2024-04-24 10:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found