Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

One reason for not using sort -u or uniq commands is if you wish to retain the original ordering (minus the discards). If that's the case, this might work for you.

The problem with uniqing huge files with Perl, is the memory footprint of the hash required to remember all the records. And you cannot partition the dataset by record number (first N; next N; etc.), unless the records are sorted, because you need to process the whole dataset. What's needed is an alternative way of partitioning the dataset that allows the uniqing to work but without loosing the original ordering.

One way of doing this is to make multiple passes, and only consider some subset of the records during each pass. A simple partitioning mechanism is to use the first character (or n characters) of each record. For example, if all your records start with a digit, 0-9, then you can make 10 passes and only consider those that start with each digit during each pass. This reduces the memory requirement for the hash to 1/10th.

If your record start with alpha characters, then you get a natural split into 26 passes.

If the single character partition is still too large, use the first two digits/characters for a split into 100/676 passes.

If the numbers of passes is more than needed, the you can choose 'AB' for the first pass and 'CD' for the second. And so on.

You record the file offsets of each line that needs to be discarded, (on the basis that you are likely to be discarding fewer records than you are going to retain), and then sort these offsets numerically, and make a final sequential pass, checking the offset against the first offset in the discards array and only output it if it does not match. Once you found a discard, you shift that offset off the discards array and continue.

The following code assumes the records can start with any alpha-numeric character, and so will make a total of 63 passes. Tailor it to suit your data:

#! perl -sw use strict; open FILE, '<', $ARGV[ 0 ] or die $!; my @discards; ## offsets of lines to dicard my %hash; ## Remembers records seen during each pass ## IMPORTANT: Tailor the follwing to your data! ## If all your records start with a digit (0-9) ## As shown, this would make 52 redundant passes! ## During each pass, only consider lines that start with this char for my $firstChar ( 0..9, 'A'..'Z', 'a'..'z' ) { warn "Pass $firstChar\n"; my $offset = 0; ##offset starts a zero for each pass while( <FILE> ) { if( m[^$firstChar] ) { ## If the line starts with the passes + char ## Form the key from the first 7 fields my $key = join'|', ( split '\|', $_, 8 )[ 0 .. 6 ]; ## If doesn't exist yet, remember it if( not exists $hash{ $key } ) { $hash{ $key } = undef; } ## if it does add it's offset to the discards array else { push @discards, $offset; ## The offset for the last li +ne read } } $offset = tell FILE; ## Remember this offset } undef %hash; ## Empty the hash seek FILE, 0, 0; ## And reset to the start after each pass } printf STDERR "Sorting %d discards\n", scalar @discards; ## Sort the list of discard offsets ascending @discards = sort { $a <=> $b } @discards; ## Final pass while( my $line = <FILE> ) { ## if the line we just read leaves the offset equal ## to the next discard, while( @discards and tell FILE == $discards[ 0] ) { $line = <FILE>; ## discard the line and get the next shift @discards; ## And that offset } ## now output the good line (for redirection) print $line; } close FILE;

Usage: HugeUniq.pl hugeFile > uniqHugeFile

Takes about 40 minutes to process a 6 million record/2GB file on my system using 26 passes.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Re: Huge files manipulation by BrowserUk
in thread Huge files manipulation by klashxx

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (3)
As of 2024-04-26 02:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found