Huge files manipulation

klashxx has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Huge files manipulation by BrowserUk (Patriarch) on Nov 10, 2008 at 14:33 UTC
One reason for not using `sort -u` or `uniq` commands is if you wish to retain the original ordering (minus the discards). If that's the case, this might work for you. The problem with uniqing huge files with Perl, is the memory footprint of the hash required to remember all the records. And you cannot partition the dataset by record number (first N; next N; etc.), unless the records are sorted, because you need to process the whole dataset. What's needed is an alternative way of partitioning the dataset that allows the uniqing to work but without loosing the original ordering. One way of doing this is to make multiple passes, and only consider some subset of the records during each pass. A simple partitioning mechanism is to use the first character (or n characters) of each record. For example, if all your records start with a digit, 0-9, then you can make 10 passes and only consider those that start with each digit during each pass. This reduces the memory requirement for the hash to 1/10th. If your record start with alpha characters, then you get a natural split into 26 passes. If the single character partition is still too large, use the first two digits/characters for a split into 100/676 passes. If the numbers of passes is more than needed, the you can choose 'AB' for the first pass and 'CD' for the second. And so on. You record the file offsets of each line that needs to be discarded, (on the basis that you are likely to be discarding fewer records than you are going to retain), and then sort these offsets numerically, and make a final sequential pass, checking the offset against the first offset in the discards array and only output it if it does not match. Once you found a discard, you shift that offset off the discards array and continue. The following code assumes the records can start with any alpha-numeric character, and so will make a total of 63 passes. Tailor it to suit your data: #! perl -sw use strict; open FILE, '<', $ARGV[ 0 ] or die $!; my @discards; ## offsets of lines to dicard my %hash; ## Remembers records seen during each pass ## IMPORTANT: Tailor the follwing to your data! ## If all your records start with a digit (0-9) ## As shown, this would make 52 redundant passes! ## During each pass, only consider lines that start with this char for my $firstChar ( 0..9, 'A'..'Z', 'a'..'z' ) { warn "Pass $firstChar\n"; my $offset = 0; ##offset starts a zero for each pass while( <FILE> ) { if( m[^$firstChar] ) { ## If the line starts with the passes + char ## Form the key from the first 7 fields my $key = join'\|', ( split '\\|', $_, 8 )[ 0 .. 6 ]; ## If doesn't exist yet, remember it if( not exists $hash{ $key } ) { $hash{ $key } = undef; } ## if it does add it's offset to the discards array else { push @discards, $offset; ## The offset for the last li +ne read } } $offset = tell FILE; ## Remember this offset } undef %hash; ## Empty the hash seek FILE, 0, 0; ## And reset to the start after each pass } printf STDERR "Sorting %d discards\n", scalar @discards; ## Sort the list of discard offsets ascending @discards = sort { $a <=> $b } @discards; ## Final pass while( my $line = <FILE> ) { ## if the line we just read leaves the offset equal ## to the next discard, while( @discards and tell FILE == $discards[ 0] ) { $line = <FILE>; ## discard the line and get the next shift @discards; ## And that offset } ## now output the good line (for redirection) print $line; } close FILE; [download] Usage: `HugeUniq.pl hugeFile > uniqHugeFile` Takes about 40 minutes to process a 6 million record/2GB file on my system using 26 passes. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re^2: Huge files manipulation by JavaFan (Canon) on Nov 10, 2008 at 15:50 UTC
One reason for not using sort -u or uniq commands is if you wish to retain the original ordering (minus the discards). As said earlier in the thread, if you want to keep ordering, just add the line number, sort, uniquify, sort on the line number and cut. Or, as a one liner: `nl -s '\|' file_with_dups \| sort -k 2,8 -t '\|' -u \| sort -nb \| cut -d ' +\|' -f 2- > file_without_dups` [download] This is a standard trick.	[reply] [d/l]
Re^3: Huge files manipulation by BrowserUk (Patriarch) on Nov 10, 2008 at 16:05 UTC
This is a standard trick. Yes. And it requires 4 passes of the dataset, including two full sorts. And takes 5 or 6 times as long as the script above, if I use 5 passes ('a-e','f-j', 'k-o', 'p-t', 'u-z') on the same dataset. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re^4: Huge files manipulation by JavaFan (Canon) on Nov 10, 2008 at 16:48 UTC
Re^5: Huge files manipulation by BrowserUk (Patriarch) on Nov 10, 2008 at 17:36 UTC
Re: Huge files manipulation by johngg (Canon) on Nov 10, 2008 at 14:09 UTC
If you want the output file to contain only the first instance of each key in the order found in the input file you could try processing the input file line by line. The script keeps track of the keys encountered in the `%seen` hash and only prints a record to the output file if it hasn't been seen before. If there are so many unique keys that this hash starts causing resource problems you could tie it to a disk-based DBM such as Berkeley DB or GDBM. Given the input in your OP, this code `use strict; use warnings; my $inFile = q{spw722634.in}; open my $inFH, q{<}, $inFile or die qq{open: < $inFile: $!\n}; my $outFile = q{spw722634.out}; open my $outFH, q{>}, $outFile or die qq{open: > $outFile: $!\n}; my %seen = (); while( <$inFH> ) { my $key = join q{}, ( split m{\\|}, $_, 8 )[ 0 .. 6 ]; print $outFH $_ unless $seen{ $key } ++; } close $inFH or die qq{close: < $inFile: $!\n}; close $outFH or die qq{close: > $outFile: $!\n};` [download] produces an output file with these records `30xx\|000009925000194653\|00000000000000\|20081031\|02510\|00000005445363\|0 +1\|F\|0207\|00\|\|\|+0005655,00\|\|\|+0000000000000,00 30xx\|4150010003502043\|CARDS\|20081031\|MP415001\|00000024265698\|01\|F\|1804 +\|00\|\|\|+0000000000000,00\|\|\|+0000000000000,00` [download] I hope this is the sort of solution you are aiming for and that you find this of use. Cheers, JohnGG	[reply] [d/l] [select]
Re: Huge files manipulation by JavaFan (Canon) on Nov 10, 2008 at 12:18 UTC
I find it very surprising that sort end with an out of memory, as sort dates from the time memory was tiny compared to current times, and back then it was able to sort huge files - sort will use temporary files to avoid having to use more memory than the OS is able to give it. Your problem sounds like a task 'sort -u' was made for.	[reply]
Re^2: Huge files manipulation by graff (Chancellor) on Nov 10, 2008 at 21:10 UTC
Unix and gnu sort rely on having an adequate amount of available disk space on /tmp; if the OP is using a machine with insufficient free space on /tmp, `sort -u` will fail.	[reply] [d/l]
Re^3: Huge files manipulation by JavaFan (Canon) on Nov 10, 2008 at 21:41 UTC
Only if you haven't told sort if can use another directory. (GNU) sort will either use the directory given with the -T option, the directory in the TMPDIR environment variable, or /tmp. In that order. The POSIX standard mentions that -T is undocumented, but existing in some implementations, and it encourages implementors to support the TMPDIR environment variable.	[reply]
Re^2: Huge files manipulation by rowdog (Curate) on Nov 10, 2008 at 22:49 UTC
I'm not surprised at all. HPUX is not GNU/Linux. The GNU tools are much more flexible and built to overcome the limitations found in many standard UNIX tools. As an example, I remember using `ls` in a directory with many files (>64k, oops) and HP's ls just barfed. Of course GNU's ls has no such problem. If it's optional in the standard, GNU has it but HPUX probably doesn't. Disclaimer: I haven't run any "big iron" in over 5 years so things may have changed.	[reply] [d/l]
Re: Huge files manipulation by educated_foo (Vicar) on Nov 10, 2008 at 12:09 UTC
You need to do some sort of divide-and-conquer approach. For example, you could split up the data into a bunch of temporary files according to the first 2-3 fields, uniquify each of those files, then cat them back together into one big file.	[reply]
Re: Huge files manipulation by mirod (Canon) on Nov 10, 2008 at 12:33 UTC
I am curious to know why is `sort` (the unix tool) not working? If the file is unsorted (on the composite key) and you care about the order of the lines, why not add the line number, sort -u on the key, sort on the line number and then `cut` the line number? If the problem is getting the key using the admittedly obscure `sort` options, you can have a perl (or awk, or whatever) script gather the complete key and add it as an extra field on each line, so that sort can find it easily. Incidentally, this is very similar to a Schwartzian transform.	[reply]
Re: Huge files manipulation by gone2015 (Deacon) on Nov 10, 2008 at 20:35 UTC
The simple Perl approach is to read the file and put all the 'keys' into a hash, which will tell you which are duplicated. But hashes are memory hungry, and it's not surprising that this won't work for your large files. If you expect a relatively small number of duplicate keys, then the divide and conquer approach would be to eliminate as many not-duplicates as possible -- to get down to a set of keys that that can be held in a hash. The following is what I have in mind: read the original file, identify each key and run it through a digest function (the module String::CRC32 does a good job of returning a 32 bit digest). Use a bitmap to spot repeat digest values. This requires 512MB of memory. See sketch for how to do that, below. Record digest values that appear more than once. discard the bitmap and load the recorded digest values into a hash. read the original file again, identify each key and again run it through the digest function. Any key whose digest is not in the hash of digests, you can ignore. The remainder are potential duplicates, which you put into another hash in order to identify true duplicates. now you have a hash from which you can extract the actual duplicate keys. noting that how well this will work depends largely on how many duplicates there are. By way of an estimate: the small sample you gave suggesting a line length of more than 100 bytes, so a file of ~2G contains ~2E7 keys. If the keys were all different, and using CRC32 as the digest, I would expect ~1E5 candidate digests to be collected by Step 1 -- that is 1/200 of the original keys -- which looks tractable. If only a small number of keys are duplicated, then this is a reasonable estimate of the memory requirement. If 5% of the keys are repeated once each, then the number of real duplicates dominates. This would give ~1E6 keys, which I would expect to still be tractable -- even assuming 200-300 bytes per key, this is less than the 512M required for the bitmap. However, if 20% of the keys are repeated once each, then that's ~4E6 keys in the final hash, which may still be a challenge memory-wise along side the hash containing the ~4E6 key digests. Read more... See the code (8 kB)	[reply] [d/l]
Re: Huge files manipulation by blazar (Canon) on Nov 11, 2008 at 12:20 UTC
I personally believe that have been nice of you to mention that you asked the very same question elsewhere. (Also available for webby guys!) Did they help you there? How did they fail to do so? `--` If you can't understand the incipit, then please check the IPB Campaign.	[reply] [d/l]


laziness, impatience, and hubris
	PerlMonks