Re: Huge files manipulation

One reason for not using sort -u or uniq commands is if you wish to retain the original ordering (minus the discards). If that's the case, this might work for you.

The problem with uniqing huge files with Perl, is the memory footprint of the hash required to remember all the records. And you cannot partition the dataset by record number (first N; next N; etc.), unless the records are sorted, because you need to process the whole dataset. What's needed is an alternative way of partitioning the dataset that allows the uniqing to work but without loosing the original ordering.

One way of doing this is to make multiple passes, and only consider some subset of the records during each pass. A simple partitioning mechanism is to use the first character (or n characters) of each record. For example, if all your records start with a digit, 0-9, then you can make 10 passes and only consider those that start with each digit during each pass. This reduces the memory requirement for the hash to 1/10th.

If your record start with alpha characters, then you get a natural split into 26 passes.

If the single character partition is still too large, use the first two digits/characters for a split into 100/676 passes.

If the numbers of passes is more than needed, the you can choose 'AB' for the first pass and 'CD' for the second. And so on.

You record the file offsets of each line that needs to be discarded, (on the basis that you are likely to be discarding fewer records than you are going to retain), and then sort these offsets numerically, and make a final sequential pass, checking the offset against the first offset in the discards array and only output it if it does not match. Once you found a discard, you shift that offset off the discards array and continue.

The following code assumes the records can start with any alpha-numeric character, and so will make a total of 63 passes. Tailor it to suit your data:

#! perl -sw
use strict;

open FILE, '<', $ARGV[ 0 ] or die $!;

my @discards; ## offsets of lines to dicard
my %hash;     ## Remembers records seen during each pass

## IMPORTANT: Tailor the follwing to your data!
## If all your records start with a digit (0-9)
## As shown, this would make 52 redundant passes!

## During each pass, only consider lines that start with this char
for my $firstChar ( 0..9, 'A'..'Z', 'a'..'z' ) {
    warn "Pass $firstChar\n";

    my $offset = 0; ##offset starts a zero for each pass
    while( <FILE> ) {

        if( m[^$firstChar] ) {   ## If the line starts with the passes
+ char

            ## Form the key from the first 7 fields
            my $key = join'|', ( split '\|', $_, 8 )[ 0 .. 6 ];

            ## If doesn't exist yet, remember it
            if( not exists $hash{ $key } ) {
                $hash{ $key } = undef;
            }
            ## if it does add it's offset to the discards array
            else {
                push @discards, $offset; ## The offset for the last li
+ne read
            }
        }
        $offset = tell FILE; ## Remember this offset
    }
    undef %hash;      ## Empty the hash
    seek FILE, 0, 0;  ## And reset to the start after each pass
}

printf STDERR "Sorting %d discards\n", scalar @discards;
## Sort the list of discard offsets ascending
@discards = sort { $a <=> $b } @discards;

## Final pass
while( my $line = <FILE> ) {

    ## if the line we just read leaves the offset equal
    ## to the next discard,
    while( @discards and tell FILE == $discards[ 0] ) {

        $line = <FILE>; ## discard the line and get the next
        shift @discards; ## And that offset
    }
    ## now output the good line (for redirection)
    print $line;
}
close FILE;
[download]

Usage: HugeUniq.pl hugeFile > uniqHugeFile

Takes about 40 minutes to process a 6 million record/2GB file on my system using 26 passes.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Comment on Re: Huge files manipulation Select or Download Code

Replies are listed 'Best First'.
Re^2: Huge files manipulation by JavaFan (Canon) on Nov 10, 2008 at 15:50 UTC
One reason for not using sort -u or uniq commands is if you wish to retain the original ordering (minus the discards). As said earlier in the thread, if you want to keep ordering, just add the line number, sort, uniquify, sort on the line number and cut. Or, as a one liner: `nl -s '\|' file_with_dups \| sort -k 2,8 -t '\|' -u \| sort -nb \| cut -d ' +\|' -f 2- > file_without_dups` [download] This is a standard trick.	[reply] [d/l]
Re^3: Huge files manipulation by BrowserUk (Patriarch) on Nov 10, 2008 at 16:05 UTC
This is a standard trick. Yes. And it requires 4 passes of the dataset, including two full sorts. And takes 5 or 6 times as long as the script above, if I use 5 passes ('a-e','f-j', 'k-o', 'p-t', 'u-z') on the same dataset. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re^4: Huge files manipulation by JavaFan (Canon) on Nov 10, 2008 at 16:48 UTC
Yes. And it requires 4 passes of the dataset, including two full sorts. The number of passes isn't really relevant - the sorts maybe. But the main reason I posted the one-liner was that you appear to suggest that the sort solution wouldn't work if you wanted to keep order. I use 5 passes ('a-e','f-j', 'k-o', 'p-t', 'u-z') on the same dataset. And that shows the weakness of your approach. It requires a priory knowledge about the keys. A bad pick of dividing the keys may lead to almost all the keys being handled in the same pass. You'd need to tune your program for different datasets.	[reply]
Re^5: Huge files manipulation by BrowserUk (Patriarch) on Nov 10, 2008 at 17:36 UTC


No such thing as a small change
	PerlMonks