No such thing as a small change | |
PerlMonks |
Re: Huge files manipulationby BrowserUk (Patriarch) |
on Nov 10, 2008 at 14:33 UTC ( [id://722662]=note: print w/replies, xml ) | Need Help?? |
One reason for not using sort -u or uniq commands is if you wish to retain the original ordering (minus the discards). If that's the case, this might work for you. The problem with uniqing huge files with Perl, is the memory footprint of the hash required to remember all the records. And you cannot partition the dataset by record number (first N; next N; etc.), unless the records are sorted, because you need to process the whole dataset. What's needed is an alternative way of partitioning the dataset that allows the uniqing to work but without loosing the original ordering. One way of doing this is to make multiple passes, and only consider some subset of the records during each pass. A simple partitioning mechanism is to use the first character (or n characters) of each record. For example, if all your records start with a digit, 0-9, then you can make 10 passes and only consider those that start with each digit during each pass. This reduces the memory requirement for the hash to 1/10th. If your record start with alpha characters, then you get a natural split into 26 passes. If the single character partition is still too large, use the first two digits/characters for a split into 100/676 passes. If the numbers of passes is more than needed, the you can choose 'AB' for the first pass and 'CD' for the second. And so on. You record the file offsets of each line that needs to be discarded, (on the basis that you are likely to be discarding fewer records than you are going to retain), and then sort these offsets numerically, and make a final sequential pass, checking the offset against the first offset in the discards array and only output it if it does not match. Once you found a discard, you shift that offset off the discards array and continue. The following code assumes the records can start with any alpha-numeric character, and so will make a total of 63 passes. Tailor it to suit your data:
Usage: HugeUniq.pl hugeFile > uniqHugeFile Takes about 40 minutes to process a 6 million record/2GB file on my system using 26 passes. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
In Section
Seekers of Perl Wisdom
|
|