Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re^2: Remove double bracket and singe quotes

by marioroy (Prior)
on May 02, 2016 at 19:23 UTC ( [id://1162034]=note: print w/replies, xml ) Need Help??


in reply to Re: Remove double bracket and singe quotes
in thread Remove double bracket and singe quotes

Update: Added more tests. Thank you, wee.

Indeed, tr is fast. I compared the 3 regex statements to tr against a 724 MB string. Testing was done on a 2.6 GHz Core i7 machine with Perl v5.16.2.

use strict; use warnings; use Time::HiRes 'time'; my $doc = "'C-3PO' or 'See-Threepio' is a humanoid robot character fro +m the [[Star Wars]] universe who appears in the original ''Star Wars' +' films, the prequel trilogy and the sequel trilogy.\n"; $doc .= $doc for 1 .. 22; ## expand string to 724 MB print "length : ", length($doc), "\n"; # 759169024 my $start = time; # $doc =~ s/\[\[//g; ## 8.626 secs. # $doc =~ s/\]\]//g; # $doc =~ s/\'//g; # $doc =~ s/\[//g; ## 10.493 secs. # $doc =~ s/\]//g; # $doc =~ s/\'//g; # $doc =~ s/\[+//g; ## 7.050 secs. # $doc =~ s/\]+//g; # $doc =~ s/\'+//g; # $doc =~ s/(?:\[|\]|\')//g; ## 19.559 secs. # $doc =~ s/(?:\[|\]|\')+//g; ## 56.150 secs. <- did not expect this # $doc =~ s/[\[\]\']//g; ## 9.072 secs. # $doc =~ s/[\[\]\']+//g; ## 6.915 secs. $doc =~ tr/[]'//d; ## 1.908 secs. printf "duration : %7.03f secs.\n", time - $start; print "length : ", length($doc), "\n"; # 708837376

It's unfortunate that Perl doesn't know to optimize the following automatically :(

$doc =~ s/(?:\[|\]|\')//g --> $doc =~ s/[\[\]\']//g $doc =~ s/(?:\[|\]|\')+//g --> $doc =~ s/[\[\]\']+//g

Replies are listed 'Best First'.
Re^3: Remove double bracket and singe quotes
by Marshall (Canon) on May 02, 2016 at 20:23 UTC
    Very cool on the benchmarks!

    As we see, fewer lines in Perl doesn't always mean faster execution speed. Sometimes 3 lines can beat 1 line, as demonstrated by your code. That is important and this is often missed here.

    I figure Benchmark #2 is slower than BenchMark #1 because more thinking has to go on for each encounter with a bracket. In this case, specifiying an action with double bracket is faster than an action upon each bracket.

    The fastest is tr, which I expected. This thing is "dumb", but fast.

Re^3: Remove double bracket and singe quotes
by mr_mischief (Monsignor) on May 02, 2016 at 21:49 UTC

    That's a good start, but I wouldn't really count on such a limited benchmark too heavily. Take a look at Re: Faster way to do this? for an example of how trustworthy a quick benchmark with one data set really isn't.

    The first question should be whether 8 seconds is too long in the first place. If someone's only running this once, you've already spent far more time saving 6 seconds than it's worth.

      The first question should be whether 8 seconds is too long in the first place.
      That is indeed the question, because those 8 seconds are for testdata of 724MB, that is: a processing speed of about 100MB/second. Since the text would appear to be marked up text from a wiki or similar, I doubt the strings would ever be longer than 30kBytes. And then, the task is done in a fraction of a millisecond.

      Nice article. Thank you for sharing.

Re^3: Remove double bracket and singe quotes
by SimonPratt (Friar) on May 03, 2016 at 09:54 UTC

    It's unfortunate that Perl doesn't know to optimize the following automatically :(
    $doc =~ s/(?:\[|\]|\')//g --> $doc =~ s/[\[\]\']//g $doc =~ s/(?:\[|\]|\')+//g --> $doc =~ s/[\[\]\']+//g

    Maybe unfortunate in your specific example, but really not surprising at all. The statements (whilst the same in this very specific instance) are not actually the same at all. Consider:

    use 5.16.2; use warnings; my $doc = "'C-3PO' or 'See-Threepio' is a humanoid robot character fro +m the [[Star Wars]] universe who appears in the original ''Star Wars' +' films, the prequel trilogy and the sequel trilogy.\n"; $doc =~ s/(?:\[\[|\]\]|'')//g; # Only replace doubles, to make sequenc +e longer than a single character say $doc; $doc =~ "'C-3PO' or 'See-Threepio' is a humanoid robot character from +the [[Star Wars]] universe who appears in the original ''Star Wars'' +films, the prequel trilogy and the sequel trilogy.\n"; $doc =~ s/[\[\]']//g; # set based replacement say $doc;

    In English, the first match could be described as Match sequence x or sequence y or sequence z, do not keep the matching group. The second match could be described as Match any characters in set a.

    Sometimes there is a trade-off in just how much work the optimiser will do. Obvious conversions due to simple style differences are easy and cheap. Less obvious conversions like this, where in some edge cases it is faster to optimise, I suspect you'll find the decision is to leave the optimisation up to the developer more often than not.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1162034]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (6)
As of 2024-04-19 13:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found