Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re: selecting columns from a tab-separated-values file

by topher (Scribe)
on Jan 24, 2013 at 20:33 UTC ( [id://1015233]=note: print w/replies, xml ) Need Help??


in reply to selecting columns from a tab-separated-values file

Here are a few suggestions based on experience dealing with processing lots of very large log files (not a perfect match, but a lot of which would apply here).

  • First suggestion, which I think I saw that you've now tested, is to try splitting the logic into separate programs. With multiple CPU's, if you your processing is CPU bound, splitting the work into multiple piped programs can allow you to take advantage of those CPUs, and will frequently outperform a single-app solution.

  • I also saw mention that it was a large file, 80GB or something like that? If it doesn't get updated frequently, you might want to try compressing the file. Using a fast compression algorithm (gzip, lzo, etc.) you can often read and decompress data faster than you could read the uncompressed data from disk. With multiple CPU's, this can result in a net improvement in performance. This is worth testing, of course, as it will depend heavily on your specific circumstances (disk speed, CPU speed, RAM and disk caching, etc.)

  • Another possible suggestion, depending on many hardware/system factors, would be to split it up, along the lines of Map-Reduce. Split your file (physically, or logically), and process the chunks with separate programs, then combine the results. A naive example might be a program that gets the file size, breaks it into 10GB chunks, then forks into a corresponding number of child processes, each of which does work on it's chunk of the file, while returning the results to be aggregated at the end.

  • If you don't need to return a result set for every single record in the file every time, then the suggestion to try a real database is an excellent one. I love SQLite, and it can handle quite a bit (although 80GB might be pushing it), but if you only wanted to return results for a smaller matching subset of the data, you're almost certainly going to win big with a database.

  • If you really wanted to squeeze every bit of performance out of this, and optimize it to the extreme, you'd do well off to read about the Wide Finder Projects/Experiments kicked off by Tim Bray:

    It's worth googling for "Wide Finder" and checking out some of the other write-ups discussing it, too.

Much as I love Perl, I probably would have done something like this as a first shot for the described processing:

$ zcat datafile.gz | awk -F'\t' '{print $3,$1,$6}' | gzip -c > output.gz

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1015233]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (4)
As of 2024-03-28 20:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found