Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re: Re: Re: Removing repeated lines from file

by zengargoyle (Deacon)
on Jun 25, 2003 at 22:24 UTC ( [id://269042]=note: print w/replies, xml ) Need Help??


in reply to Re: Re: Removing repeated lines from file
in thread Removing repeated lines from file

how clustered are your lines? here are a couple tweakable previous line recognizers.

# # remember everything, probably uses too much memory. # { my %seen; sub seen_complete { return 1 if exists $seen{$_[0]}; $seen{$_[0]} = (); return 0; } } # # remember last N lines only. # { my %seen; my $remember = 200; my @memory; sub seen_fixed { return 1 if exists $seen{$_[0]}; delete $seen{shift(@memory)} if @memory > $remember; push @memory, $_[0]; $seen{$_[0]} = (); return 0; } } # # remember N buckets of lines with X lines per bucket. # { my @bucket = ( {} ); my $numbuckets = 2; my $bucketsize = 200; sub seen_bucket { foreach (@bucket) { return 1 if exists $_->{$_[0]}; } if (keys %{$bucket[-1]} >= $bucketsize) { shift @bucket if @bucket >= $numbuckets; push @bucket, {}; } $bucket[-1]->{$_[0]} = (); return 0; } }

i only tested the last one, and only sorta tested it at that.

while (<>) { print unless seen_bucket($_); } __END__ Ten sets of 1..400 should get uniq'd to 1..400 $ perl -le 'for(1..10){for(1..400){print}}' | perl dup.pl | wc -l 400 Ten sets of 1..401 should get uniq'd to (1..401) x 10 because 2 buckets of 200 lines hold upto 400 $ perl -le 'for(1..10){for(1..401){print}}' | perl dup.pl | wc -l 4010

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://269042]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (7)
As of 2024-04-26 09:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found