Need to process a tab delimted file *FAST*

devnul has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Need to process a tab delimted file FAST by kvale (Monsignor) on Mar 03, 2004 at 04:59 UTC
The module Text::CSV_XS parses records using an XS module, which may be faster than a pure Perl solution. It can handle tab-delimited files as well: `use Text::CSV_XS; $csv = Text::CSV_XS->new( sep_char => "\t"); $columns_ref = $csv->getline($io);` [download] -Mark	[reply] [d/l]
Re: Need to process a tab delimted file FAST by devnul (Monk) on Mar 03, 2004 at 09:04 UTC
Thanks for your help everyone... Here's where I am at now: #1. Text::CSV_XS is extremely slow for some reason. #2. SQL-lite/etc are not feasable here... #3. I would like to see a 10-fold performance increase over the figures I describe below.... #4. So far Storable seems like the best option, but this leaves me with another problem... Please allow me to explain further: This is essentially a "log file" containing something which might be represented this way: `hash->{KEY1}=VALUE1 hash->{KEY2}=VALUE2 hash->{KEY3}->{SUBKEY1}=VALUE3 hash->{KEY3}->{SUBKEY2}=VALUE4 hash->{KEY4}=VALUE5` [download] So each "record" will contain at this point in time 16 keys (it is expected to grow). .. That being said, here's the 'rest of the story': I need to compute TOTAL, MAXIMUM, and LAST values for each one of the elements of the hash (include "SUBKEY"'s) over several periods of time.. So for a time period of like an hour I would be looking for the total, maximum, and last value of each one of those values.... The way I did this before was to build a second hash, modeled after the first with the keys changed to "TOT::(key)", "MAX::(key)", and "LAST::(key)"... Benchmarks have shown me that doing the "split/map" method and computing these values takes somewhere around .0008 seconds (cpu time) per record. Using the Storable method, I am able to read the original hash in very quickly, but when iterating through it I am still showing benchmarks of .0007 seconds. The number of records are enough that after doing all of this at .0008 seconds it ends up taking more then 60 seconds of machine time. While this may not see enormous, it pushes me beyond the window I have to work in.... Here is the snippet I am using to iterate through the hash (note that I iterate through the array in reverse order newest to oldest records).. I am also typing this from memory, so it may not be exactly correct...: my @keys=keys(%{$hash}); my $junk; foreach my $key (@keys) { if(ref($hash->{$key})) { my @keys2=keys(%{$hash->{$key}); foreach my $key2 (@keys2) { $junk->{'TOT::'.$key}->{$key2}+=$hash->{$key}->{$key2}; if($hash->{$key}->{$key2} > $junk->{'MAX::'.$key}->{$key2}) { $junk->{'MAX::'.$key}->{$key2}=$hash->{$key}->{$key2}; } if(!defined($junk->{'LAST::'.$key}->{$key2})) { $junk->{'LAST::'.$key}->{$key2}=$hash->{$key}->{$key2}; } } } } else { $junk->{'TOT::'.$key}+=$hash->{$key}; if($hash->{$key} > $junk->{'MAX::'.$key}) { $junk->{'MAX::'.$key}=$hash->{$key}; } if(!defined($junk->{'LAST::'.$key})) { $junk->{'LAST::'.$key}=$hash->{$key}; } } } [download] Thanks, again, for all the helpful replies! You folks are great! - Greg	[reply] [d/l] [select]
Re: Re: Need to process a tab delimted file FAST by tilly (Archbishop) on Mar 03, 2004 at 15:46 UTC
An obvious performance problem is that you keep on accessing $hash->$key. Perl has to do the work of a hash lookup every time. Insert `my $value = $hash->$key;` and access $value repeatedly instead. Another small speedup. Don't use keys to dump data into an array that only exists for you to run through it. Just run directly through the array. If you can figure out how to push both the data and this work to a decent relational database, you will see bigger speedups still. Moving to a faster machine always helps. If your code is I/O bound and you can parallelize it, then that can speed things up. The same happens if you are CPU bound and have multiple CPUs or computers to take advantage of. But at 0.0008 seconds per record, if it takes over 60 seconds, then you have over 75,000 records. At some point you have to understand that computers are not magic. Doing work takes time. You aren't going to hit a particular performance level just because the powers that be said that you must.	[reply] [d/l]
Re: Re: Re: Need to process a tab delimted file FAST by fizbin (Chaplain) on Mar 04, 2004 at 01:59 UTC
Specifically, tilly's advice can be implemented with judicious use of `each`. Doing this transforms your example into: my $junk; while (my ($outterkey,$outterval) = each %$hash) { if(ref($outterval)) { while (my ($innerkey,$innerval) = each %$outterval) { $junk->{'TOT::'.$key}->{$key2}+=$innerval; if($innerval > $junk->{'MAX::'.$key}->{$key2}) { $junk->{'MAX::'.$key}->{$key2}=$innerval; } if(!defined($junk->{'LAST::'.$key}->{$key2})) { $junk->{'LAST::'.$key}->{$key2}=$innerval; } } } else { $junk->{'TOT::'.$key}+=$outterval; if($outterval > $junk->{'MAX::'.$key}) { $junk->{'MAX::'.$key}=$outterval; } if(!defined($junk->{'LAST::'.$key})) { $junk->{'LAST::'.$key}=$outterval; } } } [download] Now, you've still got two recalculations of $junk->{'MAX::'.$key} and $junk->{'LAST::'.$key} (each involving a string concatenation and a hash lookup) per loop, whereas it might be possible to have only one, but I'll let you look at this first, to see if it gives acceptable speed improvements (not likely, given that this is squeezing out the last few extra percent, and you want 90% of the time gone). One test I'd do if I were you to make certain that the kind of speed you want is even vaguely possible is time the unix `wc` command against the input file. I often use wc as a absolute lower bound when looking at file processing speed issues, and figure that I've done as good as I can if I can get the speed down to within twice the time that the wc executeable takes. Also, on your initial text processing question, is this a faster way to split up the file? `my %hash; while ($line = <INPUTFILE>) { chomp($line); while($line =~ /\t([^=])=([^\t])/g) { $hash{$1}=$2; } }` [download] This assumes that you're discarding $tstamp, which you seem to be.	[reply] [d/l] [select]
Re: Re: Need to process a tab delimted file FAST by clintp (Curate) on Mar 03, 2004 at 13:56 UTC
If the I/O isn't your problem then consider the following: * Don't use Perl. Re-write this in C, keep your keys in a tree of some kind (determined by the distribution of the keys) and the values (tot, max, last) in the tree. It's always faster to use a specifically designed tool (a single-purpose, well written C program) than a general purpose multitool (like Perl). * If you have any control over the input stream: instead of tab-delimited data, use space/column-delimited data. It's (probably) faster to extract two substrings of known width than to split. Leave the whitespace on the keys when storing, remove it later when redisplaying (if necessary).	[reply]
Re: Re: Need to process a tab delimted file FAST by jfroebe (Parson) on Mar 03, 2004 at 16:30 UTC
Hi, I just wonder if splitting the workload between two or more instances of your script will help. Whether this means updating a shared memory structure at the end of the processing, or similar (please forgive me if this doesn't make alot of sense - I have the flu and am at work). For example, script #1 will process the first 10,000 lines, script #2 will process the second 10,000 lines, etc. then update a common structure at the end of the processing or parallel piped to a third script that merges the data jason	[reply]
Re: Need to process a tab delimted file FAST by Roger (Parson) on Mar 03, 2004 at 04:42 UTC
Not significantly faster, but certainly less code... `chomp(my @rec = split /\t/, $line, 16); my %hash = map { split /=/ } @rec[1..$#rec];` [download]	[reply] [d/l]
Re: Re: Need to process a tab delimted file FAST by devnul (Monk) on Mar 03, 2004 at 04:55 UTC
Yeah, that's less code, but as you pointed out not much faster... :-( Is this basically "as fast as it gets" when trying to parse these type of files? I have the option of creating some other type of datastore (as long as it can be stored in a file). Would tying a dbm file be faster possibly? (I plan on testing this, but I must admit to not being overly optimistic..)... - Greg	[reply]
Re: Re: Re: Need to process a tab delimted file FAST by davido (Cardinal) on Mar 03, 2004 at 05:22 UTC
There are a couple of problems with your current "flat file" solution (and all flat-file alternative solutions). First, you are reading it in line by line. Sure, you could slurp, but even then there's not a huge speed difference, and you run into scalability problems. Second, you're splitting lines on various things. split uses a regexp-like thingy, and that means revving up the regexp engine, which while well optimized isn't as fast as non-regexp alternatives. The problem is that your current file format doesn't lend itself well to non-regexp (and non-split) alternatives. If you have control over the datasource, there are a few things you can do for better speed. One suggestion is, instead of using "\n" delimited records, and "\t" delimited key/value pairs, go with a more "regular" format. One possibility would be fixed-width fields. With that sort of solution, at least you can unpack each record. That's going to be faster than splitting on a RE. If each "line" (or record) is of equal byte-length, and each key/value within each record is of fixed width, you can use seek, tell to jump around in the file, and unpack to grab keys/values from each record. It's pretty hard to beat that for speed, within Perl. Another possibility is to abandon the flat file, and go with a database. You mentioned that you wanted to maintain a single-file for your data though. Ok, no problem. Use DBD::SQLite. It is a pretty fast database implementation that stores all of the data in one single file. There is database overhead to consider, but scalability is good, and you don't need to be as careful about maintaining equal-byte-length records with fixed-width fields. And yet another possibility is to use the Storable module to freeze and thaw your datastructures. The module is written in XS (if I'm not mistaken) and optimized for speed already. It's not as scalable of a solution, but speed is pretty good. Dave	[reply]
Re: Re: Re: Re: Need to process a tab delimted file FAST by tilly (Archbishop) on Mar 03, 2004 at 15:17 UTC
Re: Re: Re: Re: Re: Need to process a tab delimted file FAST by davido (Cardinal) on Mar 03, 2004 at 18:19 UTC
Re: Re: Re: Need to process a tab delimted file FAST by diotalevi (Canon) on Mar 03, 2004 at 05:20 UTC
Your database (whether RDBMS or a tied hash or whatever) will be the most expensive thing you do regardless.	[reply]
Re: Need to process a tab delimted file FAST by TomDLux (Vicar) on Mar 03, 2004 at 06:59 UTC
To what degree is the current code too slow? Is it a factor of a million? a thousand? two? 10 percent? How massive is the data you are proceessing? benchmarking the current code on a thousand lines, ten thousand lines, a million lines of your data. What is the projected time for processing the entire input? Does there already exist a progrgam to process the files? Are there better algorithms? Is there a way to avoid proceessing the file at all? -- `TTTATCGGTCGTTATATAGATGTTTGCA`	[reply]
Re: Need to process a tab delimted file FAST by flounder99 (Friar) on Mar 03, 2004 at 04:55 UTC
if you are sure that your line is in the form `tstamp\tkey1=val1\tkey2=val2`... `my ($tstamp, %hash) = split /[\t=]/, $line;` [download] will work -- flounder	[reply] [d/l] [select]
Re: Re: Need to process a tab delimted file FAST by Anonymous Monk on Mar 03, 2004 at 06:51 UTC
Alternatively, `my ($tstamp, %hash) = $line =~ /([^\t=]*)/g` [download]	[reply] [d/l]
Re: Need to process a tab delimted file FAST by toma (Vicar) on Mar 05, 2004 at 05:47 UTC
If you are doing math on hash keys, then you have tripped into a perl problem: perl treats the number as a string when it is used as a hash key and as a number while doing math. Converting between these two representations is relatively slow. If you can use mathematical algorithms for doing your statistics that do not involve number-as-hash-key then you might easily achieve a 10X improvement. Especially if the math is done in a module that is implemented with C code. It should work perfectly the first time! - toma	[reply]