Dot star okay, or not?

Cirollo has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
(Ovid) Re: Dot star okay, or not? by Ovid (Cardinal) on Jul 05, 2001 at 20:52 UTC
You rang? ;) The problem with the dot star in your regex is in how it's used. Since you are using minimal matching, it should be quicker than a greedy expression with all of its backtracking, but you've chosen to match to the end of the string, so you have to backtrack to find out where the spaces start, thus making this regex inefficient. A couple of monks advocated a solution similar to the following: `$data =~ s/^\s//; $data =~ s/\s$//;` [download] That solution works and it's faster than what you have listed, but since it matches zero or more spaces, it will always do a substitution, even if there is nothing to substitute. Try changing the asterisk to a plus and it will run much faster. The proof is in the Benchmark: `use Benchmark; sub dotstar { my $data = $testdata; $data =~ s/^\s(.?)\s$/$1/; return $data; } sub first_n_last { my $data = $testdata; $data =~ s/^\s//; $data =~ s/\s$//; return $data; } sub first_n_last_must_match { my $data = $testdata; $data =~ s/^\s+//; $data =~ s/\s+$//; return $data; } $testdata = ' ' x 200 . "abcd" x 20 . " " x 200; timethese( 100000, { dotstar => '&dotstar', first_n_last_1 => '&first_n_last', first_n_last_2 => '&first_n_last_must_match' } )` [download] That produces the following results: `Benchmark: timing 100000 iterations of dotstar, first_n_last_1, first_ +n_last_2... dotstar: 7 wallclock secs ( 6.91 usr + 0.02 sys = 6.93 CPU) @ 14 +430.01/s (n=100000) first_n_last_1: 4 wallclock secs ( 4.21 usr + 0.00 sys = 4.21 CPU) +@ 23775.56/s (n=100000) first_n_last_2: 2 wallclock secs ( 1.30 usr + 0.00 sys = 1.30 CPU) +@ 76804.92/s (n=100000)` [download] Usual disclaimer:* Don't forget that a general rule is not an inflexible one. The mileage you get out of various solutions may vary. Your regex is fine if you're only testing a couple of lines and aren't worried about performance. It's easy to read and I wouldn't sweat it. If, however, you're working with large data sets, you probably want the faster solutions. Cheers, Ovid Vote for paco! Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.	[reply] [d/l] [select]
Re(2): Dot star okay, or not? by Cirollo (Friar) on Jul 05, 2001 at 21:45 UTC
Can you tell my why physi's code is slower than the rest? He suggested this: `$data =~ s/(^\s\|\s$)//g;` I added these two subs to your benchmark: `sub both_at_once { my $data = $testdata; $data =~ s/(^\s+\|\s+$)//g; return $data; } sub both_at_once2 { my $data = $testdata; $data =~ s/(^\s\|\s$)//g; return $data; }` [download] And this was the result: Benchmark: timing 100000 iterations of both_at_once, both_at_once2, do +tstar, first_n_last_1, first_n_last_2... both_at_once: 10 wallclock secs ( 9.04 usr + 0.00 sys = 9.04 CPU) @ +11061.95/s (n=100000) both_at_once2: 11 wallclock secs (10.40 usr + 0.00 sys = 10.40 CPU) @ + 9615.38/s (n=100000) dotstar: 9 wallclock secs ( 8.30 usr + 0.00 sys = 8.30 CPU) @ 12 +048.19/s (n=100000) first_n_last_1: 6 wallclock secs ( 5.77 usr + 0.00 sys = 5.77 CPU) +@ 17331.02/s (n=100000) first_n_last_2: 2 wallclock secs ( 2.31 usr + 0.00 sys = 2.31 CPU) +@ 43290.04/s (n=100000) [download] Unless I'm mistaken, the pattern alternation (^\s+\|\s+$) will try to match both patterns on every character. But, does the engine not know to disregard the ^\s+ except at the beginning of the string, and likewise for \s+$, only trying to match at the end? Just curious as to why this is so slow.	[reply] [d/l] [select]
(Ovid) Re(3): Dot star okay, or not? by Ovid (Cardinal) on Jul 05, 2001 at 22:16 UTC
If you really want to get a good handle on how regular expressions work, try reading "Mastering Regular Expressions" by Jeffrey Friedl. Further, you can try the re pragma to see the regex engine at work: `use strict; use re 'debug'; my $string = 'abcdC'; print "Matched: $1\n" if $string =~ /((?<!b)[cC])/;` [download] Try various strings and regexes and you'll begin to understand that output. The nice thing is that this will also show you some of the optimizations that the regex engine performs. Cheers, Ovid Vote for paco! Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.	[reply] [d/l]
Re: Re(2): Dot star okay, or not? by japhy (Canon) on Jul 05, 2001 at 22:00 UTC
To answer you, no, Perl doesn't optimize your regex to look only at the beginning and end of the string. Sorry. `japhy` -- Perl and Regex Hacker	[reply]
Re: Dot star okay, or not? by Beatnik (Parson) on Jul 05, 2001 at 20:07 UTC
I'd suggest `$scalar =~ s/^\s//; $scalar =~ s/\s$//;` for that stuff... which'll get you of the hook for using dot-star. Death to Dot Star! is there for a reason, it's worth a moment of your meditation time :) Greetz Beatnik ... Quidquid perl dictum sit, altum viditur.	[reply] [d/l]
Re: Dot star okay, or not? by Abigail (Deacon) on Jul 06, 2001 at 03:02 UTC
There is no need to waste time removing empty strings from the beginning and the end. Just do what the FAQ says: `$scalar =~ s/^\s+//; $scalar =~ s/\s+$//;` [download] -- Abigail	[reply] [d/l]
Re: Dot star okay, or not? by srawls (Friar) on Jul 05, 2001 at 20:11 UTC
Try: `s/^\s//; s/\s$//;` [download] That is more efficient, because you only match the whitespace, not the whole string. And as to your question, of course dot star is okay, you just have to know when it is okay, that is the tricky part for people just learning regexes. The 15 year old, freshman programmer, Stephen Rawls	[reply] [d/l]
Re: Dot star okay, or not? by andreychek (Parson) on Jul 05, 2001 at 20:11 UTC
Correct? Well, probably in most cases. However, after doing a search for "strip white space", I ran across a similar post a couple years back, that proposed this solution: `# This code originally posted by faq_monk for ($string) { s/^\s+//; s/\s+$//; }` [download] His thoughts were that using .* was slow, destructive, and may fail with embedded newlines. Again, his words. I would definitely recommend digging through the archives for information on this, as a lot of people have posted on this over the years, and there are bound to be a lot of insights and clever solutions like the one faq_monk posted. -Eric	[reply] [d/l]
Re: Dot star okay, or not? by lhoward (Vicar) on Jul 05, 2001 at 20:18 UTC
My prefered way is to do it in 2 lines: `$string=~s/^\s+//; $string=~s/\s+$//;` [download]	[reply] [d/l]
Re: Dot star okay, or not? by japhy (Canon) on Jul 05, 2001 at 21:56 UTC
I dislike your approach, for a few reasons: the "match X and replace it with itself" approach slow creeping of `.*?` breaks on embedded newlines For these reasons, it is much better (and faster) to take the two-regex approach shown to you several times already. `japhy` -- Perl and Regex Hacker	[reply]
Re: Dot star okay, or not? by scain (Curate) on Jul 05, 2001 at 20:09 UTC
Cirollo, I would be more careful to delimit what you want to keep, i.e., `$string =~ s/^\s(\S.?\S)\s$/$1/;` Scott Update: OK, I agree with several other posters indicating that `$scalar =~ s/^\s//; $scalar =~ s/\s*$//;` is better and certainly faster.	[reply] [d/l] [select]
Re: Re: Dot star okay, or not? by Hofmator (Curate) on Jul 05, 2001 at 20:40 UTC
This just makes things unnecessarily more complicated, scain. The original version is absolutely equivalent to yours and shorter to write - thus easier to understand. `$string =~ s/^\s(.?)\s$/$1/;` Some further explanation: The starting \s eats up all whitespaces (because its greedy). Then (.?) starts capturing and the first character must be \S (or the end of the string for something matching /^\s$/). Due to its non-greediness the (.?) advances slowly one character at a time, always trying to match afterwards the rest of the pattern (\s$) and backtracks if not successful. So all trailing whitespaces are for sure eaten up by the greedy \s* at the end of the pattern leaving a \S as the last character in the capturing brackets. The solution with two replaces given by many other monks is preferable as it is quicker doesn't get caught on embedded newlines (as . matches by default everything but a newline) - if you only want to remove space at the beginning and end of the string is easily adaptable to remove all leading and trailing whitespaces on a slurped file: `$file =~ s/^\s+//mg; $file =~ s/\s+$//mg;` [download] -- Hofmator	[reply] [d/l] [select]
Re: Dot star okay, or not? by physi (Friar) on Jul 05, 2001 at 20:11 UTC
Well I think `.` is ok in your case. Maybe you can do it by: `$text=~s/(^\s\|\s*$)//g;` [download] Then you do not need to store the 'middelthing'. This might be quicker, but I don't know this exactly. Anyway it works :) `----------------------------------- --the good, the bad and the physi-- -----------------------------------` [download]	[reply] [d/l] [select]


Perl-Sensitive Sunglasses
	PerlMonks