Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: Search and replace the word in Column 16

by imp (Priest)
on Jul 25, 2006 at 13:06 UTC ( [id://563511]=note: print w/replies, xml ) Need Help??


in reply to Search and replace the word in Column 16

The most flexible solution, and the one least likely to confuse coworkers, would be to split the string, test column 16, replace column 16, create a new string w/ join. e.g.:
sub split_join { my $line = shift; my @tokens = split /[|]/, $line; if ($tokens[15] eq 'STOCK') { $tokens[15] = 'BOXXE'; return join('|',@tokens); } else { return $line; } }
But the regex approach will run faster (by 77% according to my tests).
use strict; use warnings; use Benchmark qw(cmpthese); my $line = <DATA>; printf "Original: $line"; printf " split: %s",split_join($line);; printf " simple: %s",simple_regex($line);; cmpthese(5000, { splitjoin => sub {split_join($line)}, simple_regex => sub {simple_regex($line)}, }); sub split_join { my $line = shift; my @tokens = split /[|]/, $line; if ($tokens[15] eq 'STOCK') { $tokens[15] = 'BOXXE'; return join('|',@tokens); } else { return $line; } } sub simple_regex { my $line = shift; #$line =~ s/^((?:[^|]*\|){15})STOCK/${1}BOXXE/; $line =~ s{^ ( (?: [^|]* \| ) {15} ) STOCK } {${1}BOXXE}x; return $line; } __DATA__ AT0000937503|20060530|||142.708534||GROUP AG|30618720||||OPEN|ISIN|494 +3402|VSE|STOCK|39600000|0.77320|STOCK|test
Results:
Original: AT0000937503|20060530|||142.708534||GROUP AG|30618720||||OPE +N|ISIN|4943402|VSE|STOCK|39600000|0.77320|STOCK|test split: AT0000937503|20060530|||142.708534||GROUP AG|30618720||||OPE +N|ISIN|4943402|VSE|BOXXE|39600000|0.77320|STOCK|test simple: AT0000937503|20060530|||142.708534||GROUP AG|30618720||||OPE +N|ISIN|4943402|VSE|BOXXE|39600000|0.77320|STOCK|test Rate splitjoin simple_regex splitjoin 4274/s -- -44% simple_regex 7576/s 77% --

Replies are listed 'Best First'.
Re^2: Search and replace the word in Column 16
by davidrw (Prior) on Jul 25, 2006 at 14:35 UTC
    This will run faster, and IMHO improves upon split_join() a little ..
    sub index_split_join { return $_[0] unless index $_[0], 'STOCK' >= 0; # do a fast check +to see if line needs to be looked at my @tokens = split /\|/, $_[0]; # split into columns $tokens[15] =~ s/STOCK/BOXXE/; # do replacement in col 16 return join('|',@tokens); # glue back together for final r +esult }
    For your test of 1 data line, i get:
    Rate splitjoin idxsplitjoin simple_regex splitjoin 50000/s -- -15% -60% idxsplitjoin 58824/s 18% -- -53% simple_regex 125000/s 150% 112% --
    But that test isn't valid. Presumably (?!?) there are many lines that need to be processed, and only a small percentage have the word 'STOCK' in them (which is where the index short circuit will excel). Here is a modified benchmark (the DATA is ~1000 lines, all with same # of cols, but only a handful have STOCK in them):
    my @lines = <DATA>; cmpthese(10000, { idxsplitjoin => sub {index_split_join($_) for @lines}, splitjoin => sub {split_join($_) for @lines}, simple_regex => sub {simple_regex($_) for @lines}, }); # RESULTS: Benchmark: timing 10000 iterations of idxsplitjoin, simple_regex, spli +tjoin... idxsplitjoin: 9 wallclock secs ( 9.16 usr + 0.00 sys = 9.16 CPU) @ +1091.70/s (n=10000) simple_regex: 11 wallclock secs (10.77 usr + 0.00 sys = 10.77 CPU) @ +928.51/s (n=10000) splitjoin: 158 wallclock secs (158.15 usr + 0.00 sys = 158.15 CPU) @ + 63.23/s (n=10000) Rate splitjoin simple_regex idxsplitjoin splitjoin 63.2/s -- -93% -94% simple_regex 929/s 1368% -- -15% idxsplitjoin 1092/s 1627% 18% --
      Ah the perils of posting before your first cup of coffee in the morning - my original intent is exactly what you provided.

      good catch.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://563511]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (3)
As of 2024-04-23 06:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found