comment on

One reason for not using sort -u or uniq commands is if you wish to retain the original ordering (minus the discards). If that's the case, this might work for you.

The problem with uniqing huge files with Perl, is the memory footprint of the hash required to remember all the records. And you cannot partition the dataset by record number (first N; next N; etc.), unless the records are sorted, because you need to process the whole dataset. What's needed is an alternative way of partitioning the dataset that allows the uniqing to work but without loosing the original ordering.

One way of doing this is to make multiple passes, and only consider some subset of the records during each pass. A simple partitioning mechanism is to use the first character (or n characters) of each record. For example, if all your records start with a digit, 0-9, then you can make 10 passes and only consider those that start with each digit during each pass. This reduces the memory requirement for the hash to 1/10th.

If your record start with alpha characters, then you get a natural split into 26 passes.

If the single character partition is still too large, use the first two digits/characters for a split into 100/676 passes.

If the numbers of passes is more than needed, the you can choose 'AB' for the first pass and 'CD' for the second. And so on.

You record the file offsets of each line that needs to be discarded, (on the basis that you are likely to be discarding fewer records than you are going to retain), and then sort these offsets numerically, and make a final sequential pass, checking the offset against the first offset in the discards array and only output it if it does not match. Once you found a discard, you shift that offset off the discards array and continue.

The following code assumes the records can start with any alpha-numeric character, and so will make a total of 63 passes. Tailor it to suit your data:

#! perl -sw
use strict;

open FILE, '<', $ARGV[ 0 ] or die $!;

my @discards; ## offsets of lines to dicard
my %hash;     ## Remembers records seen during each pass

## IMPORTANT: Tailor the follwing to your data!
## If all your records start with a digit (0-9)
## As shown, this would make 52 redundant passes!

## During each pass, only consider lines that start with this char
for my $firstChar ( 0..9, 'A'..'Z', 'a'..'z' ) {
    warn "Pass $firstChar\n";

    my $offset = 0; ##offset starts a zero for each pass
    while( <FILE> ) {

        if( m[^$firstChar] ) {   ## If the line starts with the passes
+ char

            ## Form the key from the first 7 fields
            my $key = join'|', ( split '\|', $_, 8 )[ 0 .. 6 ];

            ## If doesn't exist yet, remember it
            if( not exists $hash{ $key } ) {
                $hash{ $key } = undef;
            }
            ## if it does add it's offset to the discards array
            else {
                push @discards, $offset; ## The offset for the last li
+ne read
            }
        }
        $offset = tell FILE; ## Remember this offset
    }
    undef %hash;      ## Empty the hash
    seek FILE, 0, 0;  ## And reset to the start after each pass
}

printf STDERR "Sorting %d discards\n", scalar @discards;
## Sort the list of discard offsets ascending
@discards = sort { $a <=> $b } @discards;

## Final pass
while( my $line = <FILE> ) {

    ## if the line we just read leaves the offset equal
    ## to the next discard,
    while( @discards and tell FILE == $discards[ 0] ) {

        $line = <FILE>; ## discard the line and get the next
        shift @discards; ## And that offset
    }
    ## now output the good line (for redirection)
    print $line;
}
close FILE;
[download]

Usage: HugeUniq.pl hugeFile > uniqHugeFile

Takes about 40 minutes to process a 6 million record/2GB file on my system using 26 passes.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

In reply to Re: Huge files manipulation by BrowserUk
in thread Huge files manipulation by klashxx

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Clear questions and runnable code get the best and fastest answer
	PerlMonks