Speed of Split

Lexicon has asked for the wisdom of the Perl Monks concerning the following question:

I don't think there's a good answer to this other than "Do it in C" but I thought I'd ask anyway. I'm reading in several hundred 20k line standard data text files, with space delimited numbers like so, though with up to 8 data columns:

# time             data1     data2
0.000000          99.537     54.54
1.000000         100.273    121.54
2.000000          98.169    121.58
3.000000         105.835     99.66
4.000000          93.013      1.85
[download]

The time spent spliting the lines is about 10% of my program run time, so I was wondering if there was an easy way to speed it up. Just reading the files from disk seems to be about half my runtime, so no dramatic improvements possible. But look at the code, maybe a restructuring would be faster? Here's what I'm doing now:

# Each data point is stored in a separate file, so they have
# to be joined first.  I figure the shell is faster at that
# than perl.  That and I wasn't excited about managing 
# n file handles at a time, though it wouldn't be too bad.
my @lines = `join file1 file2 file3 file4`;
foreach my $line (@lines) {
  my ($time, @data) = split /\s+/, $line;
  foreach my $datum (@data) {
    # unlock the secrets of the universe
  }
}
[download]

Lexicon

Comment on Speed of Split Select or Download Code

Replies are listed 'Best First'.
Re: Speed of Split by Corion (Patriarch) on Nov 18, 2004 at 07:44 UTC
These fields are "space delimited", but they also seem to be in a fixed column format, so a look at `pack`/`unpack` and their various parameters. This would save you the trip to the regex engine, at the price of some more parsing time spent in the number parser that has to skip the spaces now. You can maybe save more time by making `@lines` and `@data` live in a larger scope, and you can save time by not shelling out to the `join` program: `local @ARGV = ($file1, $file2, $file3, $file4); my (@lines) = <>; my (@data); for my $line (@lines) { @data = split /\s+/, $line; ... };` [download]	[reply] [d/l]
Re^2: Speed of Split by ikegami (Patriarch) on Nov 18, 2004 at 07:53 UTC
I think you're confusing `join` with `cat` `$ cat a 1 a 2 A $ cat b 1 b 2 B $ join a b 1 a b 2 A B` [download] Your code would return: `1 a 2 A 1 b 2 B` [download]	[reply] [d/l] [select]
Re: Speed of Split by ysth (Canon) on Nov 18, 2004 at 07:46 UTC
I hesitate to mention this, but... There is currently a minor optimization in split when assigning to a global array. You might give that a try: `local @data; foreach my $line (@lines) { @data = split /\s+/, $line; $time = shift @data; foreach my $datum (@data) { } }` [download] untested, untimed.	[reply] [d/l]
Re^2: Speed of Split by Lexicon (Chaplain) on Nov 18, 2004 at 09:53 UTC
Odds are I'll go another route, but out of curiosity, does this speedup hold when assigning to something like @NAMESPACE::data? Lexicon	[reply]
Re^3: Speed of Split by ysth (Canon) on Nov 18, 2004 at 10:09 UTC
Yes, it does. $ perl -MO=Concise,-exec -e'@NAMESPACE::data = split / /, $x' 1 <0> enter 2 <;> nextstate(main 1 -e:1) v 3 </> pushre(/" "/ => @data) s/64 4 <#> gvsv[x] s 5 <$> const[IV 0] s 6 <@> split[t4] vK 7 <@> leave[1 ref] vKP/REFC -e syntax OK $ perl -MO=Concise,-exec -e'@data = split / /, $x' 1 <0> enter 2 <;> nextstate(main 1 -e:1) v 3 </> pushre(/" "/ => @data) s/64 4 <#> gvsv[x] s 5 <$> const[IV 0] s 6 <@> split[t4] vK 7 <@> leave[1 ref] vKP/REFC -e syntax OK $ perl -MO=Concise,-exec -e'my @data; @data = split / /, $x' 1 <0> enter 2 <;> nextstate(main 1 -e:1) v 3 <0> padav[@data:1,2] vM/LVINTRO 4 <;> nextstate(main 2 -e:1) v 5 <0> pushmark s 6 </> pushre(/" "/) s/64 7 <#> gvsv[x] s 8 <$> const[IV 0] s 9 <@> split[t3] lK a <0> pushmark s b <0> padav[@data:1,2] lRM c <2> aassign[t4] vKS d <@> leave[1 ref] vKP/REFC -e syntax OK [download] See above that the array assignment is completely bypassed; instead the array is attached in a funny kind of way to the regex which is then passed to split, and split places the results directly into the array instead of returning them.	[reply] [d/l]
Re^2: Speed of Split by itub (Priest) on Nov 19, 2004 at 01:18 UTC
I also noticed that split in void context, although deprecated, is a little bit faster (or it was the last time I checked). `split; # splits into @_` [download]	[reply] [d/l]
Re: Speed of Split by ikegami (Patriarch) on Nov 18, 2004 at 08:10 UTC
This may not be representative, but a simple test shows that regexp could be much much faster here: use Benchmark (); our @data; my $line = '1.000000 ' . ' 100.273 121.54 98.169 121.58' . ' 100.273 121.54 98.169 121.58' . ' 100.273 121.54 98.169 121.58' . ' 100.273 121.54 98.169 121.58'; Benchmark::cmpthese(0, { split => sub { @data = split(/\s+/, $line) }, fixed_length => sub { @data = $line =~ /^.{8} {6}(.{10})(.{10})(.{1 +0})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})$/ +}, var_length => sub { @data = $line =~ /^.{8}\s+(\S+)\s+(\S+)\s+(\S ++)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+ +(\S+)$/ }, }); __END__ Rate split var_length fixed_length split 63116/s -- -30% -87% var_length 90310/s 43% -- -81% fixed_length 482454/s 664% 434% -- [download] Of course, `fixed_length` would assume that you do your own joining, since `join` would not preserve field widths.	[reply] [d/l] [select]
Re^2: Speed of Split by bart (Canon) on Nov 18, 2004 at 14:32 UTC
You missed a few alternatives. I've added them myself, and ran the benchmark again. I must say that they perform rather poorly. `Benchmark::cmpthese(0, { split => sub { @data = split(/\s+/, $line) }, fixed_length => sub { @data = $line =~ /^.{8} {6}(.{10})(.{10})(.{1 +0})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})$/ +}, var_length => sub { @data = $line =~ /^.{8}\s+(\S+)\s+(\S+)\s+(\S ++)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+ +(\S+)$/ }, g => sub { @data = $line =~ /\S+/g; }, unpack => sub { @data = unpack 'A8x6A10A10A10A10A10A10A10A10A +10A10A10A10', $line } });` [download] Result: `Rate g unpack split var_length fi +xed_length g 16954/s -- -54% -70% -76% + -96% unpack 36961/s 118% -- -35% -47% + -91% split 56965/s 236% 54% -- -19% + -86% var_length 70373/s 315% 90% 24% -- + -83% fixed_length 408377/s 2309% 1005% 617% 480% + --` [download] You ignore the first field, I include it... but that shouldn't matter much.	[reply] [d/l] [select]
Re^2: Speed of Split by Chady (Priest) on Nov 18, 2004 at 10:04 UTC
Here's what your benchmarks result in on one of my machines: `Rate split fixed_length var_length split 26023/s -- -64% -93% fixed_length 71906/s 176% -- -81% var_length 378591/s 1355% 427% --` [download] this is with perl, v5.6.1 built for i386-linux on a 333MHz celeron He who asks will be a fool for five minutes, but he who doesn't ask will remain a fool for life. Chady \| http://chady.net/ Are you a Linux user in Lebanon? join the Lebanese Linux User Group.	[reply] [d/l]
Re: Speed of Split by Random_Walk (Prior) on Nov 18, 2004 at 08:53 UTC
If each data point is in a seperate file why join them only to split them again ? if your files are nicely arranged so that for each time interval you have eight data files then just read these 8 directly into an AoA. Can you give some examples of what the pre-joined source files are like ? You stand to gain on not shelling out of perl to join and not splitting. If the files are fixed record length there may be even more optimisation possible. Cheers, R.	[reply]
Re^2: Speed of Split by Lexicon (Chaplain) on Nov 18, 2004 at 09:47 UTC
A fine question. I'm uncertain what assumptions I can make about the data files, as I don't control the code which generates them. Each individual data file looks like: `# time data 0.000000 99.537 1.000000 100.273 2.000000 98.169 3.000000 105.835 4.000000 93.013 5.000000 96.145 6.000000 87.040 7.000000 97.764 8.000000 97.811` [download] I have to join the data files based on the time point. I can probably assume that the time points will be ordered and the same in each file, and also a fixed column width that, worst case, I can calculate per file. I cannot assume the timepoints will always be integers. I was being conservative when I wrote this, but now it seems to be my bottleneck. I am sending an email to the other developer asking what guarantees we can work out about it. Lexicon	[reply] [d/l]
Re^2: Speed of Split by Lexicon (Chaplain) on Nov 20, 2004 at 14:05 UTC
Making some assumptions and writing some 20 lines of custom import code has made the whole program roughly 3x faster (about 5 minutes per set of data files on a 900mhz athlon). Lexicon	[reply]

Back to Seekers of Perl Wisdom