http://qs321.pair.com?node_id=408706

Lexicon has asked for the wisdom of the Perl Monks concerning the following question:

I don't think there's a good answer to this other than "Do it in C" but I thought I'd ask anyway. I'm reading in several hundred 20k line standard data text files, with space delimited numbers like so, though with up to 8 data columns:
# time data1 data2 0.000000 99.537 54.54 1.000000 100.273 121.54 2.000000 98.169 121.58 3.000000 105.835 99.66 4.000000 93.013 1.85
The time spent spliting the lines is about 10% of my program run time, so I was wondering if there was an easy way to speed it up. Just reading the files from disk seems to be about half my runtime, so no dramatic improvements possible. But look at the code, maybe a restructuring would be faster? Here's what I'm doing now:
# Each data point is stored in a separate file, so they have # to be joined first. I figure the shell is faster at that # than perl. That and I wasn't excited about managing # n file handles at a time, though it wouldn't be too bad. my @lines = `join file1 file2 file3 file4`; foreach my $line (@lines) { my ($time, @data) = split /\s+/, $line; foreach my $datum (@data) { # unlock the secrets of the universe } }

Replies are listed 'Best First'.
Re: Speed of Split
by Corion (Patriarch) on Nov 18, 2004 at 07:44 UTC

    These fields are "space delimited", but they also seem to be in a fixed column format, so a look at pack/unpack and their various parameters. This would save you the trip to the regex engine, at the price of some more parsing time spent in the number parser that has to skip the spaces now. You can maybe save more time by making @lines and @data live in a larger scope, and you can save time by not shelling out to the join program:

    local @ARGV = ($file1, $file2, $file3, $file4); my (@lines) = <>; my (@data); for my $line (@lines) { @data = split /\s+/, $line; ... };

      I think you're confusing join with cat

      $ cat a 1 a 2 A $ cat b 1 b 2 B $ join a b 1 a b 2 A B

      Your code would return:

      1 a 2 A 1 b 2 B
Re: Speed of Split
by ysth (Canon) on Nov 18, 2004 at 07:46 UTC
    I hesitate to mention this, but...

    There is currently a minor optimization in split when assigning to a global array. You might give that a try:

    local @data; foreach my $line (@lines) { @data = split /\s+/, $line; $time = shift @data; foreach my $datum (@data) { } }
    untested, untimed.
      Odds are I'll go another route, but out of curiosity, does this speedup hold when assigning to something like @NAMESPACE::data?
        Yes, it does.
        $ perl -MO=Concise,-exec -e'@NAMESPACE::data = split / /, $x' 1 <0> enter 2 <;> nextstate(main 1 -e:1) v 3 </> pushre(/" "/ => @data) s/64 4 <#> gvsv[*x] s 5 <$> const[IV 0] s 6 <@> split[t4] vK 7 <@> leave[1 ref] vKP/REFC -e syntax OK $ perl -MO=Concise,-exec -e'@data = split / /, $x' 1 <0> enter 2 <;> nextstate(main 1 -e:1) v 3 </> pushre(/" "/ => @data) s/64 4 <#> gvsv[*x] s 5 <$> const[IV 0] s 6 <@> split[t4] vK 7 <@> leave[1 ref] vKP/REFC -e syntax OK $ perl -MO=Concise,-exec -e'my @data; @data = split / /, $x' 1 <0> enter 2 <;> nextstate(main 1 -e:1) v 3 <0> padav[@data:1,2] vM/LVINTRO 4 <;> nextstate(main 2 -e:1) v 5 <0> pushmark s 6 </> pushre(/" "/) s/64 7 <#> gvsv[*x] s 8 <$> const[IV 0] s 9 <@> split[t3] lK a <0> pushmark s b <0> padav[@data:1,2] lRM* c <2> aassign[t4] vKS d <@> leave[1 ref] vKP/REFC -e syntax OK

        See above that the array assignment is completely bypassed; instead the array is attached in a funny kind of way to the regex which is then passed to split, and split places the results directly into the array instead of returning them.

      I also noticed that split in void context, although deprecated, is a little bit faster (or it was the last time I checked).

      split; # splits into @_
Re: Speed of Split
by ikegami (Patriarch) on Nov 18, 2004 at 08:10 UTC

    This may not be representative, but a simple test shows that regexp could be much much faster here:

    use Benchmark (); our @data; my $line = '1.000000 ' . ' 100.273 121.54 98.169 121.58' . ' 100.273 121.54 98.169 121.58' . ' 100.273 121.54 98.169 121.58' . ' 100.273 121.54 98.169 121.58'; Benchmark::cmpthese(0, { split => sub { @data = split(/\s+/, $line) }, fixed_length => sub { @data = $line =~ /^.{8} {6}(.{10})(.{10})(.{1 +0})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})$/ +}, var_length => sub { @data = $line =~ /^.{8}\s+(\S+)\s+(\S+)\s+(\S ++)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+ +(\S+)$/ }, }); __END__ Rate split var_length fixed_length split 63116/s -- -30% -87% var_length 90310/s 43% -- -81% fixed_length 482454/s 664% 434% --

    Of course, fixed_length would assume that you do your own joining, since join would not preserve field widths.

      You missed a few alternatives. I've added them myself, and ran the benchmark again.

      I must say that they perform rather poorly.

      Benchmark::cmpthese(0, { split => sub { @data = split(/\s+/, $line) }, fixed_length => sub { @data = $line =~ /^.{8} {6}(.{10})(.{10})(.{1 +0})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})$/ +}, var_length => sub { @data = $line =~ /^.{8}\s+(\S+)\s+(\S+)\s+(\S ++)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+ +(\S+)$/ }, g => sub { @data = $line =~ /\S+/g; }, unpack => sub { @data = unpack 'A8x6A10A10A10A10A10A10A10A10A +10A10A10A10', $line } });

      Result:

      Rate g unpack split var_length fi +xed_length g 16954/s -- -54% -70% -76% + -96% unpack 36961/s 118% -- -35% -47% + -91% split 56965/s 236% 54% -- -19% + -86% var_length 70373/s 315% 90% 24% -- + -83% fixed_length 408377/s 2309% 1005% 617% 480% + --

      You ignore the first field, I include it... but that shouldn't matter much.

      Here's what your benchmarks result in on one of my machines:

      Rate split fixed_length var_length split 26023/s -- -64% -93% fixed_length 71906/s 176% -- -81% var_length 378591/s 1355% 427% --

      this is with perl, v5.6.1 built for i386-linux on a 333MHz celeron


      He who asks will be a fool for five minutes, but he who doesn't ask will remain a fool for life.
      Chady | http://chady.net/
      Are you a Linux user in Lebanon? join the Lebanese Linux User Group.
Re: Speed of Split
by Random_Walk (Prior) on Nov 18, 2004 at 08:53 UTC

    If each data point is in a seperate file why join them only to split them again ? if your files are nicely arranged so that for each time interval you have eight data files then just read these 8 directly into an AoA. Can you give some examples of what the pre-joined source files are like ? You stand to gain on not shelling out of perl to join and not splitting. If the files are fixed record length there may be even more optimisation possible.

    Cheers,
    R.

      A fine question. I'm uncertain what assumptions I can make about the data files, as I don't control the code which generates them. Each individual data file looks like:
      # time data 0.000000 99.537 1.000000 100.273 2.000000 98.169 3.000000 105.835 4.000000 93.013 5.000000 96.145 6.000000 87.040 7.000000 97.764 8.000000 97.811
      I have to join the data files based on the time point. I can probably assume that the time points will be ordered and the same in each file, and also a fixed column width that, worst case, I can calculate per file. I cannot assume the timepoints will always be integers. I was being conservative when I wrote this, but now it seems to be my bottleneck. I am sending an email to the other developer asking what guarantees we can work out about it.
      Making some assumptions and writing some 20 lines of custom import code has made the whole program roughly 3x faster (about 5 minutes per set of data files on a 900mhz athlon).