Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Dot star okay, or not?

by Cirollo (Friar)
on Jul 05, 2001 at 19:59 UTC ( [id://94141]=perlquestion: print w/replies, xml ) Need Help??

Cirollo has asked for the wisdom of the Perl Monks concerning the following question:

I've been using this to strip leading and trailing whitespace from a string:
$string =~ s/^\s*(.*?)\s*$/$1/;
But, nodes like Death to Dot Star make me a little wary of using the (.*?)

Is this solution correct? Or is there a better way? (of course... there's always a better way :)

Replies are listed 'Best First'.
(Ovid) Re: Dot star okay, or not?
by Ovid (Cardinal) on Jul 05, 2001 at 20:52 UTC

    You rang? ;)

    The problem with the dot star in your regex is in how it's used. Since you are using minimal matching, it should be quicker than a greedy expression with all of its backtracking, but you've chosen to match to the end of the string, so you have to backtrack to find out where the spaces start, thus making this regex inefficient.

    A couple of monks advocated a solution similar to the following:

    $data =~ s/^\s*//; $data =~ s/\s*$//;

    That solution works and it's faster than what you have listed, but since it matches zero or more spaces, it will always do a substitution, even if there is nothing to substitute. Try changing the asterisk to a plus and it will run much faster. The proof is in the Benchmark:

    use Benchmark; sub dotstar { my $data = $testdata; $data =~ s/^\s*(.*?)\s*$/$1/; return $data; } sub first_n_last { my $data = $testdata; $data =~ s/^\s*//; $data =~ s/\s*$//; return $data; } sub first_n_last_must_match { my $data = $testdata; $data =~ s/^\s+//; $data =~ s/\s+$//; return $data; } $testdata = ' ' x 200 . "abcd" x 20 . " " x 200; timethese( 100000, { dotstar => '&dotstar', first_n_last_1 => '&first_n_last', first_n_last_2 => '&first_n_last_must_match' } )

    That produces the following results:

    Benchmark: timing 100000 iterations of dotstar, first_n_last_1, first_ +n_last_2... dotstar: 7 wallclock secs ( 6.91 usr + 0.02 sys = 6.93 CPU) @ 14 +430.01/s (n=100000) first_n_last_1: 4 wallclock secs ( 4.21 usr + 0.00 sys = 4.21 CPU) +@ 23775.56/s (n=100000) first_n_last_2: 2 wallclock secs ( 1.30 usr + 0.00 sys = 1.30 CPU) +@ 76804.92/s (n=100000)

    Usual disclaimer: Don't forget that a general rule is not an inflexible one. The mileage you get out of various solutions may vary. Your regex is fine if you're only testing a couple of lines and aren't worried about performance. It's easy to read and I wouldn't sweat it. If, however, you're working with large data sets, you probably want the faster solutions.

    Cheers,
    Ovid

    Vote for paco!

    Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

      Can you tell my why physi's code is slower than the rest? He suggested this:

      $data =~ s/(^\s*|\s*$)//g;

      I added these two subs to your benchmark:

      sub both_at_once { my $data = $testdata; $data =~ s/(^\s+|\s+$)//g; return $data; } sub both_at_once2 { my $data = $testdata; $data =~ s/(^\s*|\s*$)//g; return $data; }
      And this was the result:

      Benchmark: timing 100000 iterations of both_at_once, both_at_once2, do +tstar, first_n_last_1, first_n_last_2... both_at_once: 10 wallclock secs ( 9.04 usr + 0.00 sys = 9.04 CPU) @ +11061.95/s (n=100000) both_at_once2: 11 wallclock secs (10.40 usr + 0.00 sys = 10.40 CPU) @ + 9615.38/s (n=100000) dotstar: 9 wallclock secs ( 8.30 usr + 0.00 sys = 8.30 CPU) @ 12 +048.19/s (n=100000) first_n_last_1: 6 wallclock secs ( 5.77 usr + 0.00 sys = 5.77 CPU) +@ 17331.02/s (n=100000) first_n_last_2: 2 wallclock secs ( 2.31 usr + 0.00 sys = 2.31 CPU) +@ 43290.04/s (n=100000)
      Unless I'm mistaken, the pattern alternation (^\s+|\s+$) will try to match both patterns on every character. But, does the engine not know to disregard the ^\s+ except at the beginning of the string, and likewise for \s+$, only trying to match at the end? Just curious as to why this is so slow.

        If you really want to get a good handle on how regular expressions work, try reading "Mastering Regular Expressions" by Jeffrey Friedl. Further, you can try the re pragma to see the regex engine at work:

        use strict; use re 'debug'; my $string = 'abcdC'; print "Matched: $1\n" if $string =~ /((?<!b)[cC])/;

        Try various strings and regexes and you'll begin to understand that output. The nice thing is that this will also show you some of the optimizations that the regex engine performs.

        Cheers,
        Ovid

        Vote for paco!

        Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

        To answer you, no, Perl doesn't optimize your regex to look only at the beginning and end of the string. Sorry.

        japhy -- Perl and Regex Hacker
Re: Dot star okay, or not?
by Beatnik (Parson) on Jul 05, 2001 at 20:07 UTC
    I'd suggest $scalar =~ s/^\s*//; $scalar =~ s/\s*$//; for that stuff... which'll get you of the hook for using dot-star. Death to Dot Star! is there for a reason, it's worth a moment of your meditation time :)

    Greetz
    Beatnik
    ... Quidquid perl dictum sit, altum viditur.
      There is no need to waste time removing empty strings from the beginning and the end. Just do what the FAQ says:
      $scalar =~ s/^\s+//; $scalar =~ s/\s+$//;

      -- Abigail

Re: Dot star okay, or not?
by srawls (Friar) on Jul 05, 2001 at 20:11 UTC
    Try:
    s/^\s*//; s/\s*$//;
    That is more efficient, because you only match the whitespace, not the whole string. And as to your question, of course dot star is okay, you just have to know when it is okay, that is the tricky part for people just learning regexes.

    The 15 year old, freshman programmer,
    Stephen Rawls

Re: Dot star okay, or not?
by andreychek (Parson) on Jul 05, 2001 at 20:11 UTC
    Correct? Well, probably in most cases. However, after doing a search for "strip white space", I ran across a similar post a couple years back, that proposed this solution:
    # This code originally posted by faq_monk for ($string) { s/^\s+//; s/\s+$//; }
    His thoughts were that using .* was slow, destructive, and may fail with embedded newlines. Again, his words.

    I would definitely recommend digging through the archives for information on this, as a lot of people have posted on this over the years, and there are bound to be a lot of insights and clever solutions like the one faq_monk posted.
    -Eric
Re: Dot star okay, or not?
by lhoward (Vicar) on Jul 05, 2001 at 20:18 UTC
    My prefered way is to do it in 2 lines:
    $string=~s/^\s+//; $string=~s/\s+$//;
Re: Dot star okay, or not?
by japhy (Canon) on Jul 05, 2001 at 21:56 UTC
    I dislike your approach, for a few reasons:
    • the "match X and replace it with itself" approach
    • slow creeping of .*?
    • breaks on embedded newlines
    For these reasons, it is much better (and faster) to take the two-regex approach shown to you several times already.

    japhy -- Perl and Regex Hacker
Re: Dot star okay, or not?
by scain (Curate) on Jul 05, 2001 at 20:09 UTC
    Cirollo,

    I would be more careful to delimit what you want to keep, i.e., $string =~ s/^\s*(\S.*?\S)\s*$/$1/; Scott

    Update: OK, I agree with several other posters indicating that $scalar =~ s/^\s*//; $scalar =~ s/\s*$//; is better and certainly faster.

      This just makes things unnecessarily more complicated, scain. The original version is absolutely equivalent to yours and shorter to write - thus easier to understand. $string =~ s/^\s*(.*?)\s*$/$1/;

      Some further explanation: The starting \s* eats up all whitespaces (because its greedy). Then (.*?) starts capturing and the first character must be \S (or the end of the string for something matching /^\s*$/). Due to its non-greediness the (.*?) advances slowly one character at a time, always trying to match afterwards the rest of the pattern (\s*$) and backtracks if not successful. So all trailing whitespaces are for sure eaten up by the greedy \s* at the end of the pattern leaving a \S as the last character in the capturing brackets.

      The solution with two replaces given by many other monks is preferable as it

      • is quicker
      • doesn't get caught on embedded newlines (as . matches by default everything but a newline) - if you only want to remove space at the beginning and end of the string
      • is easily adaptable to remove all leading and trailing whitespaces on a slurped file:
        $file =~ s/^\s+//mg; $file =~ s/\s+$//mg;

      -- Hofmator

Re: Dot star okay, or not?
by physi (Friar) on Jul 05, 2001 at 20:11 UTC
    Well I think .* is ok in your case.
    Maybe you can do it by:
    $text=~s/(^\s*|\s*$)//g;
    Then you do not need to store the 'middelthing'. This might be quicker, but I don't know this exactly. Anyway it works :)
    ----------------------------------- --the good, the bad and the physi-- -----------------------------------

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://94141]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (6)
As of 2024-04-19 11:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found