Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Re: Removing repeated lines from file

by ant9000 (Monk)
on Jun 24, 2003 at 11:47 UTC ( [id://268461]=note: print w/replies, xml ) Need Help??


in reply to Removing repeated lines from file

If you can keep track of lines already read, it's trivial:
%read=(); while(defined($_=<FILE>)){ if(!defined($read_lines{$_})){ print OUTFILE $_; $read{$_}=1; } }
If that's too big for your memory to hold (is it, really?), you could try to get a unique signature for each line and save that. What about an MD5 hash of it? It's 32 bites per input line, so it could be a good starting point. Beware, MD5 is not that fast if you have billions of lines!

Replies are listed 'Best First'.
Re: Re: Removing repeated lines from file
by zby (Vicar) on Jun 24, 2003 at 12:08 UTC
    This is the only solution, as far, that solves the general problem of repeating lines not only consecutive repeating lines ++. Just fix the name of the hash, now you use two: %read_lines and %read for what should be one I think.

    A compression (broquaint based):

    perl -ni -e 'print if $seen{$_}; $seen{$_} = 1' your_file

      Here is another variation:
      perl -i.bak -ne 'print unless $h{$_}++' file

      However, since these solutions store unique lines in memory they are potentially slow for very large files.

      --
      John.

      Ooops... I should never write untested code ;-)
Re: Removing repeated lines from file
by Abigail-II (Bishop) on Jun 24, 2003 at 12:48 UTC
    Well, it was already given that storing it all in an array was taking too much memory. Given that a hash uses even more overhead than an array, your solution only wins if there are many duplicates. And while an MD5 hash may only be 32 bits (4 bytes), a single Perl scalar already takes at least 24 bytes. Plus two handfuls of bytes for being a hash entry. And the overhead of the hash itself.

    Abigail

Re: Re: Removing repeated lines from file
by husker (Chaplain) on Jun 24, 2003 at 13:57 UTC
    Although it's unlikely, it is possible that two distinct lines could generate the same MD5 hash. Then you'd erroneously toss one of the lines, thinking it was a duplicate of the first.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://268461]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (5)
As of 2024-03-29 13:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found