http://qs321.pair.com?node_id=722634

klashxx has asked for the wisdom of the Perl Monks concerning the following question:


Hi , i need a fast way to delete duplicates entrys from very hugefiles ( >2 Gbs ) , these files are in plain text.
..To clarify, this is the structure of the file:
30xx|000009925000194653|00000000000000|20081031|02510|00000005445363|01|F|0207|00|||+0005655,00|||+0000000000000,00 30xx|000009925000194653|00000000000000|20081031|02510|00000005445363|01|F|0207|00|||+0000000000000,00|||+0000000000000,00 30xx|4150010003502043|CARDS|20081031|MP415001|00000024265698|01|F|1804|00|||+0000000000000,00|||+0000000000000,00

Having a key formed by the first 7 fields i want to print or delete only the duplicates( the delimiter is the pipe..).

I tried all the usual methods ( awk / sort /uniq / sed /grep .. ) but it always ended with the same result (out of memory!)

In using HP-UX large servers.

I 'm very new to perl, but i read somewhere tha Tie::File module can handle very large files , i tried but cannot get the right code...

Any advice will be very well come.

Thank you in advance.

Regards

PD:I do not want to split the files.