(Ovid) Re: Dot star okay, or not?
by Ovid (Cardinal) on Jul 05, 2001 at 20:52 UTC
|
You rang? ;)
The problem with the dot star in your regex is in how it's used. Since you are using minimal matching, it should be quicker than a greedy expression with all of its backtracking, but you've chosen to match to the end of the string, so you have to backtrack to find out where the spaces start, thus making this regex inefficient.
A couple of monks advocated a solution similar to the following:
$data =~ s/^\s*//;
$data =~ s/\s*$//;
That solution works and it's faster than what you have listed, but since it matches zero or more spaces, it will always do a substitution, even if there is nothing to substitute. Try changing the asterisk to a plus and it will run much faster. The proof is in the Benchmark:
use Benchmark;
sub dotstar {
my $data = $testdata;
$data =~ s/^\s*(.*?)\s*$/$1/;
return $data;
}
sub first_n_last {
my $data = $testdata;
$data =~ s/^\s*//;
$data =~ s/\s*$//;
return $data;
}
sub first_n_last_must_match {
my $data = $testdata;
$data =~ s/^\s+//;
$data =~ s/\s+$//;
return $data;
}
$testdata = ' ' x 200 . "abcd" x 20 . " " x 200;
timethese( 100000,
{
dotstar => '&dotstar',
first_n_last_1 => '&first_n_last',
first_n_last_2 => '&first_n_last_must_match'
}
)
That produces the following results:
Benchmark: timing 100000 iterations of dotstar, first_n_last_1, first_
+n_last_2...
dotstar: 7 wallclock secs ( 6.91 usr + 0.02 sys = 6.93 CPU) @ 14
+430.01/s (n=100000)
first_n_last_1: 4 wallclock secs ( 4.21 usr + 0.00 sys = 4.21 CPU)
+@ 23775.56/s (n=100000)
first_n_last_2: 2 wallclock secs ( 1.30 usr + 0.00 sys = 1.30 CPU)
+@ 76804.92/s (n=100000)
Usual disclaimer: Don't forget that a general rule is not an inflexible one. The mileage you get out of various solutions may vary. Your regex is fine if you're only testing a couple of lines and aren't worried about performance. It's easy to read and I wouldn't sweat it. If, however, you're working with large data sets, you probably want the faster solutions.
Cheers,
Ovid
Vote for paco!
Join the Perlmonks Setiathome Group or just click on the the link and check out our stats. | [reply] [d/l] [select] |
|
sub both_at_once {
my $data = $testdata;
$data =~ s/(^\s+|\s+$)//g;
return $data;
}
sub both_at_once2 {
my $data = $testdata;
$data =~ s/(^\s*|\s*$)//g;
return $data;
}
And this was the result:
Benchmark: timing 100000 iterations of both_at_once, both_at_once2, do
+tstar, first_n_last_1, first_n_last_2...
both_at_once: 10 wallclock secs ( 9.04 usr + 0.00 sys = 9.04 CPU) @
+11061.95/s (n=100000)
both_at_once2: 11 wallclock secs (10.40 usr + 0.00 sys = 10.40 CPU) @
+ 9615.38/s (n=100000)
dotstar: 9 wallclock secs ( 8.30 usr + 0.00 sys = 8.30 CPU) @ 12
+048.19/s (n=100000)
first_n_last_1: 6 wallclock secs ( 5.77 usr + 0.00 sys = 5.77 CPU)
+@ 17331.02/s (n=100000)
first_n_last_2: 2 wallclock secs ( 2.31 usr + 0.00 sys = 2.31 CPU)
+@ 43290.04/s (n=100000)
Unless I'm mistaken, the pattern alternation (^\s+|\s+$) will try to match both patterns on every character. But, does the engine not know to disregard the ^\s+ except at the beginning of the string, and likewise for \s+$, only trying to match at the end? Just curious as to why this is so slow.
| [reply] [d/l] [select] |
|
If you really want to get a good handle on how regular expressions work, try reading "Mastering Regular Expressions" by Jeffrey Friedl. Further, you can try the re pragma to see the regex engine at work:
use strict;
use re 'debug';
my $string = 'abcdC';
print "Matched: $1\n" if $string =~ /((?<!b)[cC])/;
Try various strings and regexes and you'll begin to understand that output. The nice thing is that this will also show you some of the optimizations that the regex engine performs.
Cheers,
Ovid
Vote for paco!
Join the Perlmonks Setiathome Group or just click on the the link and check out our stats. | [reply] [d/l] |
|
To answer you, no, Perl doesn't optimize your regex to look only at the beginning and end of the string. Sorry.
japhy --
Perl and Regex Hacker
| [reply] |
Re: Dot star okay, or not?
by Beatnik (Parson) on Jul 05, 2001 at 20:07 UTC
|
I'd suggest $scalar =~ s/^\s*//; $scalar =~ s/\s*$//; for that stuff... which'll get you of the hook for using dot-star. Death to Dot Star! is there for a reason, it's worth a moment of your meditation time :)
Greetz
Beatnik
... Quidquid perl dictum sit, altum viditur. | [reply] [d/l] |
|
There is no need to waste time removing empty strings from
the beginning and the end. Just do what the FAQ says:
$scalar =~ s/^\s+//;
$scalar =~ s/\s+$//;
-- Abigail | [reply] [d/l] |
Re: Dot star okay, or not?
by srawls (Friar) on Jul 05, 2001 at 20:11 UTC
|
s/^\s*//;
s/\s*$//;
That is more efficient, because you only match the whitespace, not the whole string. And as to your question, of course dot star is okay, you just have to know when it is okay, that is the tricky part for people just learning regexes.
The 15 year old, freshman programmer,
Stephen Rawls | [reply] [d/l] |
Re: Dot star okay, or not?
by andreychek (Parson) on Jul 05, 2001 at 20:11 UTC
|
Correct? Well, probably in most cases. However, after doing a search for "strip white space", I ran across a similar post a couple years back, that proposed this solution:
# This code originally posted by faq_monk
for ($string) {
s/^\s+//;
s/\s+$//;
}
His thoughts were that using .* was slow, destructive, and may fail with embedded newlines. Again, his words.
I would definitely recommend digging through the archives for information on this, as a lot of people have posted on this over the years, and there are bound to be a lot of insights and clever solutions like the one faq_monk posted.
-Eric
| [reply] [d/l] |
Re: Dot star okay, or not?
by lhoward (Vicar) on Jul 05, 2001 at 20:18 UTC
|
My prefered way is to do it in 2 lines:
$string=~s/^\s+//;
$string=~s/\s+$//;
| [reply] [d/l] |
Re: Dot star okay, or not?
by japhy (Canon) on Jul 05, 2001 at 21:56 UTC
|
I dislike your approach, for a few reasons:
- the "match X and replace it with itself" approach
- slow creeping of .*?
- breaks on embedded newlines
For these reasons, it is much better (and faster) to take the two-regex approach shown to you several times already.
japhy --
Perl and Regex Hacker | [reply] |
Re: Dot star okay, or not?
by scain (Curate) on Jul 05, 2001 at 20:09 UTC
|
Cirollo,
I would be more careful to delimit what you want to keep, i.e.,
$string =~ s/^\s*(\S.*?\S)\s*$/$1/;
Scott
Update: OK, I agree with several other posters indicating that
$scalar =~ s/^\s*//; $scalar =~ s/\s*$//; is better and
certainly faster.
| [reply] [d/l] [select] |
|
This just makes things unnecessarily more complicated,
scain. The original version is absolutely equivalent to
yours and shorter to write - thus easier to understand.
$string =~ s/^\s*(.*?)\s*$/$1/;
Some further explanation: The starting \s* eats up all
whitespaces (because its greedy). Then (.*?) starts capturing
and the first character must be \S (or the end of the string
for something matching /^\s*$/). Due to its non-greediness
the (.*?) advances slowly one character at a time, always trying
to match afterwards the rest of the pattern (\s*$) and backtracks if not
successful. So all trailing whitespaces are for sure eaten up
by the greedy \s* at the end of the pattern leaving a \S as the last character
in the capturing brackets.
The solution with two replaces given by many other monks
is preferable as it
-- Hofmator
| [reply] [d/l] [select] |
Re: Dot star okay, or not?
by physi (Friar) on Jul 05, 2001 at 20:11 UTC
|
Well I think .* is ok in your case.
Maybe you can do it by:
$text=~s/(^\s*|\s*$)//g;
Then you do not need to store the 'middelthing'. This might be quicker, but I don't know this exactly. Anyway it works :)
-----------------------------------
--the good, the bad and the physi--
-----------------------------------
| [reply] [d/l] [select] |