One regex construct to handle multiple string types

neversaint has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: One regex construct to handle multiple string types by CountZero (Bishop) on Nov 29, 2008 at 08:03 UTC
One (of many) solutions would be to anchor your regex to the end of the string: `/(.{2})$/` [download] CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l]
Re^2: One regex construct to handle multiple string types by JadeNB (Chaplain) on Nov 30, 2008 at 18:34 UTC
I think the poster didn't want so much the final two characters (which could be got by `substr($_, -2)` anyway) as non-space characters after a period. Of course, your regex could easily be adapted to `/\.(\S+)$/`; but I think the sexeger `/^(\S+)\./` should be quicker on some inputs (in the sense that it doesn't have to back-track on pathological strings like `'............... .a'`). UPDATE: My sexeger can behave badly on strings with multiple periods in them. Two natural changes are to make the `\S+` match non-greedy, or to change `\S` to `[^.]`. These have different matching behaviour, especially on strings with characters that are neither spaces nor 'word' characters, and on strings with multiple periods; but one of them might do what the poster wants. Also, I changed the sample string to one that actually matches. UPDATE 2: Oops, on re-reading, the poster explicitly wants to allow strings without any periods at all. Never mind.	[reply] [d/l] [select]
Re: One regex construct to handle multiple string types by parv (Parson) on Nov 29, 2008 at 07:32 UTC
In case of input of "2L", `\w*` eats the "2". As input string does not have an optional dot, you are left with "L" as required by `\S+`, which is then printed. Given the example strings, make preceding word letters AND the dot a single combination which is optional: `m/ (?: \w+[.] )? (\S+) /x`.	[reply] [d/l] [select]
Re^2: One regex construct to handle multiple string types by pobocks (Chaplain) on Nov 29, 2008 at 10:22 UTC
Out of curiosity, can you point me toward the precise definition of '\w'? I'm not clear as to why it eats 2 instead of 2L. `for(split(" ","tsuJ rehtonA lreP rekcaH")){print reverse . " "}print "\b.\n";`	[reply] [d/l]
Re^3: One regex construct to handle multiple string types by Krambambuli (Curate) on Nov 29, 2008 at 10:41 UTC
It's not about \w, but about backtracking. \w* initially 'eats' 2L, but then is forced to ... well... put the 'L' back on the table to let \S have it. Hmm... maybe 'eating' is not the best image for what's going on with backtracking regexps...? :) Krambambuli ---	[reply]
Re^4: One regex construct to handle multiple string types by pobocks (Chaplain) on Nov 29, 2008 at 10:46 UTC
Re^3: One regex construct to handle multiple string types by ww (Archbishop) on Nov 29, 2008 at 12:34 UTC
Precise definition depends on the language. Mastering Regular Expressions, 2nd Ed., Jeffery E. F. Friedll, published by O'Reilly characterizes`\w` in its "Common Metacharacters..." chapter, this way: Part-of_word character Often the same as `[a-zA-Z0-9_]`, although some ools omit the underscore, while others include all the extraalphanumerics characters in the locale. If Unicode is supported, `\w` usually refers to all alphanumerics (notable exception: Sun's Java regex package whose `\w` is exactly [a-zA-Z0-9_</c>). Regular Expressions Pocket Reference (also from O'Reilly) defines `\w` as: `\p{IsWord}` for Perl and as `[A-Za-z0-9_]` for Java. Regretably, the definition of `\p{isWord}` -- `[_\p{L1}\p{Lu}\p{Lt}\p{Lo}\p{Nd}` -- is, for me, almost impenetrable but Friedll's characterization may be as good as you'll get without deep study of perlretut and friends.	[reply] [d/l] [select]
Re^3: One regex construct to handle multiple string types by JadeNB (Chaplain) on Nov 30, 2008 at 18:38 UTC
Of course, Re^3: One regex construct to handle multiple string types has already answered why you get the indicated match, but …. While it doesn't seem to be in the documentation at perldoc.perl.org, the Perl 5.10 documentation for perlre has a section called "Character Classes and other Special Escapes" that says: \wMatch a "word" character (alphanumeric plus "_") UPDATE: Ah, found it, at perlre.	[reply] [d/l]
Re: One regex construct to handle multiple string types by Krambambuli (Curate) on Nov 29, 2008 at 10:19 UTC
`while(<DATA>) { /(.*\.)?(\S+)/; print "$2\n"; } __DATA__ 2L bar.2L bar.ber.bir. 2L bar.ber.bir.2L` [download] works for me. The above reads your request as "get the first substring part not containing whitespaces that follows the last dot char (if there is one) in the input string; otherwise, return the first substring part of the input string that doesn't contain whitespace". Hope that helps. Krambambuli ---	[reply] [d/l]
Re^2: One regex construct to handle multiple string types by ikegami (Patriarch) on Nov 29, 2008 at 11:35 UTC
`/(.\.)?(\S+)/` [download] has a needless capture `/(?:.\.)?(\S+)/` [download] And the `.*` is useless (OP didn't specify he wanted last possible match) `/\.?(\S+)/` [download]	[reply] [d/l] [select]
Re: One regex construct to handle multiple string types by johngg (Canon) on Nov 29, 2008 at 12:08 UTC
If you want to find a single digit followed by a single upper-case letter anywhere on the line so that "33G" or "5AB" do not match you could use look-around assertions. Look behind assertions can't be variable width so here I use an alternation of two, one for beginning of string (it's an anchor so has zero width) and one for a non-digit (width of one). `use strict; use warnings; while( <DATA> ) { chomp; printf q{%16s : }, $_; print m{(?x) (?: (?<=\A) \| (?<=\D) ) (\d[A-Z]) (?![A-Z])} ? qq{Found $1\n} : qq{No match\n}; } __DATA__ 2L bar.2L bar.ber.bir. 2L pob33J.slob bar.ber.bir.2L foo.3Hbar jar.8GH 6Ytootle par.4T.spootle` [download] The output. `2L : Found 2L bar.2L : Found 2L bar.ber.bir. 2L : Found 2L pob33J.slob : No match bar.ber.bir.2L : Found 2L foo.3Hbar : Found 3H jar.8GH : No match 6Ytootle : Found 6Y par.4T.spootle : Found 4T` [download] I hope this is of interest. Cheers, JohnGG	[reply] [d/l] [select]
Re: One regex construct to handle multiple string types by oko1 (Deacon) on Nov 29, 2008 at 14:23 UTC
If you're looking for a single digit followed by a capital letter (this is a wild guess, since you might just be looking for a literal '2L' anywhere in the string - it's rather hard to tell), then this simple regex will do: `while (<DATA>){ chomp; print "$1\n" if /(\d[A-Z])/; }` [download] -- "Language shapes the way we think, and determines what we can think about." -- B. L. Whorf	[reply] [d/l]
Re: One regex construct to handle multiple string types by grinder (Bishop) on Nov 29, 2008 at 18:18 UTC
`use strict; use warnings; use Regexp::Assemble; my $r = Regexp::Assemble->new; $r->add('2L', 'bar.2L'); my $pattern = qr/($r)/;` [download] oops, no, I didn't see it was just the '2L' part you were interested. Oh well, rather than removing the code I may as well leave it in case it solves someone else's problem :) • another intruder with the mooring in the heart of the Perl	[reply] [d/l]
Re: One regex construct to handle multiple string types by JavaFan (Canon) on Dec 01, 2008 at 12:54 UTC
The problem with asking a questing about how to match string, and just providing two examples of the problem is that noone knows what you mean. Some simple "solutions" that match all your examples include: `/(2L)/ /(..)$/ /(\p{Nd}\p{LU})/ /(?:bar.\|^)(..)/ (split /\./)[0] /(?:.\.)?(.)/ /([^.]+)$/` [download]	[reply] [d/l]


"be consistent"
	PerlMonks