simpler regex

rsiedl has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: simpler regex by Corion (Patriarch) on May 10, 2007 at 07:59 UTC
That's because a regular expression will only ever match once for every character position in a string, and no character that has been part of a previous match will be part of the next match. Let's use a different example to make talking easier: `"X Y Jones"` [download] The space between X and Y does double duty, once as "end marker" of X and once as "start marker" of Y, but as it has already been used up as end marker, it won't be looked at again for the next start marker. I see two possible ways forward - either use lookahead to check for a space and not match it, like `/(?=\s\|$)/` or use the `\b` word boundary marker, which introduces other problems though: `s/\b([A-Z])\b/$1./g` [download] will also do replacements for `O'Reilly`, `A-J` or other stuff. So, depending on your input data, that may be unwanted.	[reply] [d/l] [select]
Re: simpler regex by BrowserUk (Patriarch) on May 10, 2007 at 08:05 UTC
Not well tested, but `$_ = 'A A Jones'; s[(?<=[A-Z])(?=\s)][.]g; print;; A. A. Jones $_ = 'Bob J Smith'; s[(?<=[A-Z])(?=\s)][.]g; print;; Bob J. Smith $_ = 'Dr P J van Houten'; s[(?<=[A-Z])(?=\s)][.]g; print;; Dr P. J. van Houten` [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re^2: simpler regex by johngg (Canon) on May 10, 2007 at 09:01 UTC
I think that's going to start doing the wrong thing if Dr van Houten starts putting letters after his name; the first set are ok but as you start adding more you get unwanted dots. `$ perl -le '$_ = q{Dr P J van Houten MD}; > s[(?<=[A-Z])(?=\s)][.]g; > print;' Dr P. J. van Houten MD $ perl -le '$_ = q{Dr P J van Houten MD FRCS}; > s[(?<=[A-Z])(?=\s)][.]g; > print;' Dr P. J. van Houten MD. FRCS $` [download] A possible solution is to use alternation of two look-behinds. `$ perl -le '$_ = q{Dr P J van Houten MD FRCS}; > s{(?:(?<=\A[A-Z])\|(?<=\s[A-Z]))(?=\s)}{.}g; > print;' Dr P. J. van Houten MD FRCS $` [download] Cheers, JohnGG	[reply] [d/l] [select]
Re: simpler regex by borisz (Canon) on May 10, 2007 at 07:59 UTC
`1 while $name =~ s/(^\| )(.)( \|$)/$1$2\.$3/;` [download] Boris	[reply] [d/l]
Re: simpler regex by scorpio17 (Canon) on May 10, 2007 at 13:42 UTC
Since there's always more than one way to do it, you could consider operating on each part of the name befor doing the join: `sub fullname { my (@parts) = @_; for my $p (@parts) { $p = length($p) > 1 ? $p : # not a single - don't change $p =~ /[a-z]/i ? $p . '.' : # single letter - add a dot $p; # not a letter - don't change } my $name = join(" ", @parts); return $name; }` [download] Breaking it up like this might be preferable if the problem were more complex, because really long regex's can be difficult to maintain. For example, if you decided to add a dollar sign in front of single digits, you could just add this line (after the add a dot line): `$p =~ /\d/ ? '$' . $p : # single digit - add a $` [download] This seems easier (to me) than rewriting the regex, and less likely to introduce subtle bugs due to differences in regex behavior. Unless you're a regex master, I think it's best to keep them as simple as possible. And if you ARE a regex master, but have to work on a team where other people are NOT - then it's STILL best to keep them simple as possible!	[reply] [d/l] [select]
Re^2: simpler regex by chrism01 (Friar) on May 11, 2007 at 01:04 UTC
Following on, the CURRENT team may all be masters, but people move on. You can guarantee (99.99% of the time), that sooner or later a non-master will join ... and they'll curse your name ;-)	[reply]
Re: simpler regex by RL (Monk) on May 10, 2007 at 09:20 UTC
`$name =~ s/\b([a-z])\b/$1\./gi;` [download] Hope this helps RL update: Sorry, missed corion's answer	[reply] [d/l]
Re: simpler regex by graff (Chancellor) on May 11, 2007 at 07:12 UTC
You didn't happen to mention... is it the case that the list of input parameters for your "fullname" function happen to be the space-separated tokens that make up the full name? If that's what is being passed to the funcion, then it would be much simpler to add periods as needed before joining the parts together: `sub fullname { my @parts = @_; for ( @parts ) { $_ .= '.' if ( /^[A-Z]$/ ); } return join " ", @parts; }` [download]	[reply] [d/l]
Re^2: simpler regex by rsiedl (Friar) on May 16, 2007 at 03:36 UTC
That would be ideal, but i cant be guaranteed the user will input the data correctly. i.e. they may put in the middle name section "P J"...	[reply]
Re^3: simpler regex by graff (Chancellor) on May 16, 2007 at 05:16 UTC
That's easy enough to accommodate: `sub fullname { my @parts = @_; for ( @parts ) { s/(?<!\S)([A-Z])(?!\S)/$1./g; } return join " ", @parts; }` [download] Of course, given that sort of regex, I guess it doesn't matter whether you join the parts before or after the substitution. (It's using negative look-behind and negative look-ahead to check that a single upper-case letter is neither preceded nor following by a non-whitespace character, and in that case, put a period after the letter.)	[reply] [d/l]


Keep It Simple, Stupid
	PerlMonks