Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re: Huge files manipulation

by BrowserUk (Patriarch)
on Nov 10, 2008 at 14:33 UTC ( [id://722662]=note: print w/replies, xml ) Need Help??


in reply to Huge files manipulation

One reason for not using sort -u or uniq commands is if you wish to retain the original ordering (minus the discards). If that's the case, this might work for you.

The problem with uniqing huge files with Perl, is the memory footprint of the hash required to remember all the records. And you cannot partition the dataset by record number (first N; next N; etc.), unless the records are sorted, because you need to process the whole dataset. What's needed is an alternative way of partitioning the dataset that allows the uniqing to work but without loosing the original ordering.

One way of doing this is to make multiple passes, and only consider some subset of the records during each pass. A simple partitioning mechanism is to use the first character (or n characters) of each record. For example, if all your records start with a digit, 0-9, then you can make 10 passes and only consider those that start with each digit during each pass. This reduces the memory requirement for the hash to 1/10th.

If your record start with alpha characters, then you get a natural split into 26 passes.

If the single character partition is still too large, use the first two digits/characters for a split into 100/676 passes.

If the numbers of passes is more than needed, the you can choose 'AB' for the first pass and 'CD' for the second. And so on.

You record the file offsets of each line that needs to be discarded, (on the basis that you are likely to be discarding fewer records than you are going to retain), and then sort these offsets numerically, and make a final sequential pass, checking the offset against the first offset in the discards array and only output it if it does not match. Once you found a discard, you shift that offset off the discards array and continue.

The following code assumes the records can start with any alpha-numeric character, and so will make a total of 63 passes. Tailor it to suit your data:

#! perl -sw use strict; open FILE, '<', $ARGV[ 0 ] or die $!; my @discards; ## offsets of lines to dicard my %hash; ## Remembers records seen during each pass ## IMPORTANT: Tailor the follwing to your data! ## If all your records start with a digit (0-9) ## As shown, this would make 52 redundant passes! ## During each pass, only consider lines that start with this char for my $firstChar ( 0..9, 'A'..'Z', 'a'..'z' ) { warn "Pass $firstChar\n"; my $offset = 0; ##offset starts a zero for each pass while( <FILE> ) { if( m[^$firstChar] ) { ## If the line starts with the passes + char ## Form the key from the first 7 fields my $key = join'|', ( split '\|', $_, 8 )[ 0 .. 6 ]; ## If doesn't exist yet, remember it if( not exists $hash{ $key } ) { $hash{ $key } = undef; } ## if it does add it's offset to the discards array else { push @discards, $offset; ## The offset for the last li +ne read } } $offset = tell FILE; ## Remember this offset } undef %hash; ## Empty the hash seek FILE, 0, 0; ## And reset to the start after each pass } printf STDERR "Sorting %d discards\n", scalar @discards; ## Sort the list of discard offsets ascending @discards = sort { $a <=> $b } @discards; ## Final pass while( my $line = <FILE> ) { ## if the line we just read leaves the offset equal ## to the next discard, while( @discards and tell FILE == $discards[ 0] ) { $line = <FILE>; ## discard the line and get the next shift @discards; ## And that offset } ## now output the good line (for redirection) print $line; } close FILE;

Usage: HugeUniq.pl hugeFile > uniqHugeFile

Takes about 40 minutes to process a 6 million record/2GB file on my system using 26 passes.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^2: Huge files manipulation
by JavaFan (Canon) on Nov 10, 2008 at 15:50 UTC
    One reason for not using sort -u or uniq commands is if you wish to retain the original ordering (minus the discards).

    As said earlier in the thread, if you want to keep ordering, just add the line number, sort, uniquify, sort on the line number and cut. Or, as a one liner:

    nl -s '|' file_with_dups | sort -k 2,8 -t '|' -u | sort -nb | cut -d ' +|' -f 2- > file_without_dups
    This is a standard trick.
      This is a standard trick.

      Yes. And it requires 4 passes of the dataset, including two full sorts.

      And takes 5 or 6 times as long as the script above, if I use 5 passes ('a-e','f-j', 'k-o', 'p-t', 'u-z') on the same dataset.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Yes. And it requires 4 passes of the dataset, including two full sorts.
        The number of passes isn't really relevant - the sorts maybe. But the main reason I posted the one-liner was that you appear to suggest that the sort solution wouldn't work if you wanted to keep order.
        I use 5 passes ('a-e','f-j', 'k-o', 'p-t', 'u-z') on the same dataset.
        And that shows the weakness of your approach. It requires a priory knowledge about the keys. A bad pick of dividing the keys may lead to almost all the keys being handled in the same pass. You'd need to tune your program for different datasets.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://722662]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (7)
As of 2024-04-18 14:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found