Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

One regex construct to handle multiple string types

by neversaint (Deacon)
on Nov 29, 2008 at 07:21 UTC ( [id://726738]=perlquestion: print w/replies, xml ) Need Help??

neversaint has asked for the wisdom of the Perl Monks concerning the following question:

Dear Masters,
Given these strings:
my $str1 = "2L"; # want to capture 2L my $str2 = "bar.2L"; #want to capture 2L
I want to create one regex construct that can capture the desired string as stated above. However my regex below doesn't seem to do the job.
while(<DATA>) { /\w*\.*(\S+)/; print "$1\n"; } __DATA__ 2L bar.2L
How can I get the correct one?

---
neversaint and everlastingly indebted.......

Replies are listed 'Best First'.
Re: One regex construct to handle multiple string types
by CountZero (Bishop) on Nov 29, 2008 at 08:03 UTC
    One (of many) solutions would be to anchor your regex to the end of the string:
    /(.{2})$/

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      I think the poster didn't want so much the final two characters (which could be got by substr($_, -2) anyway) as non-space characters after a period. Of course, your regex could easily be adapted to /\.(\S+)$/; but I think the sexeger /^(\S+)\./ should be quicker on some inputs (in the sense that it doesn't have to back-track on pathological strings like '............... .a').

      UPDATE: My sexeger can behave badly on strings with multiple periods in them. Two natural changes are to make the \S+ match non-greedy, or to change \S to [^.]. These have different matching behaviour, especially on strings with characters that are neither spaces nor 'word' characters, and on strings with multiple periods; but one of them might do what the poster wants.
      Also, I changed the sample string to one that actually matches.

      UPDATE 2: Oops, on re-reading, the poster explicitly wants to allow strings without any periods at all. Never mind.

Re: One regex construct to handle multiple string types
by parv (Parson) on Nov 29, 2008 at 07:32 UTC

    In case of input of "2L", \w* eats the "2". As input string does not have an optional dot, you are left with "L" as required by \S+, which is then printed.

    Given the example strings, make preceding word letters AND the dot a single combination which is optional: m/ (?: \w+[.] )? (\S+) /x.

      Out of curiosity, can you point me toward the precise definition of '\w'? I'm not clear as to why it eats 2 instead of 2L.
      for(split(" ","tsuJ rehtonA lreP rekcaH")){print reverse . " "}print "\b.\n";
        It's not about \w, but about backtracking.

        \w* initially 'eats' 2L, but then is forced to ... well... put the 'L' back on the table to let \S have it.

        Hmm... maybe 'eating' is not the best image for what's going on with backtracking regexps...?

        :)

        Krambambuli
        ---

        Precise definition depends on the language.

        Mastering Regular Expressions, 2nd Ed., Jeffery E. F. Friedll, published by O'Reilly characterizes\w in its "Common Metacharacters..." chapter, this way:

        Part-of_word character   Often the same as [a-zA-Z0-9_], although some ools omit the underscore, while others include all the extraalphanumerics characters in the locale. If Unicode is supported, \w usually refers to all alphanumerics (notable exception: Sun's Java regex package whose \w is exactly [a-zA-Z0-9_</c>).
        Regular Expressions Pocket Reference (also from O'Reilly) defines \w as:
        • \p{IsWord} for Perl
        • and as [A-Za-z0-9_] for Java.

        Regretably, the definition of \p{isWord} -- [_\p{L1}\p{Lu}\p{Lt}\p{Lo}\p{Nd} -- is, for me, almost impenetrable but Friedll's characterization may be as good as you'll get without deep study of perlretut and friends.

Re: One regex construct to handle multiple string types
by Krambambuli (Curate) on Nov 29, 2008 at 10:19 UTC
    while(<DATA>) { /(.*\.)?(\S+)/; print "$2\n"; } __DATA__ 2L bar.2L bar.ber.bir. 2L bar.ber.bir.2L
    works for me.

    The above reads your request as "get the first substring part not containing whitespaces that follows the last dot char (if there is one) in the input string; otherwise, return the first substring part of the input string that doesn't contain whitespace".

    Hope that helps.

    Krambambuli
    ---
      /(.*\.)?(\S+)/

      has a needless capture

      /(?:.*\.)?(\S+)/

      And the .* is useless (OP didn't specify he wanted last possible match)

      /\.?(\S+)/
Re: One regex construct to handle multiple string types
by johngg (Canon) on Nov 29, 2008 at 12:08 UTC

    If you want to find a single digit followed by a single upper-case letter anywhere on the line so that "33G" or "5AB" do not match you could use look-around assertions. Look behind assertions can't be variable width so here I use an alternation of two, one for beginning of string (it's an anchor so has zero width) and one for a non-digit (width of one).

    use strict; use warnings; while( <DATA> ) { chomp; printf q{%16s : }, $_; print m{(?x) (?: (?<=\A) | (?<=\D) ) (\d[A-Z]) (?![A-Z])} ? qq{Found $1\n} : qq{No match\n}; } __DATA__ 2L bar.2L bar.ber.bir. 2L pob33J.slob bar.ber.bir.2L foo.3Hbar jar.8GH 6Ytootle par.4T.spootle

    The output.

    2L : Found 2L bar.2L : Found 2L bar.ber.bir. 2L : Found 2L pob33J.slob : No match bar.ber.bir.2L : Found 2L foo.3Hbar : Found 3H jar.8GH : No match 6Ytootle : Found 6Y par.4T.spootle : Found 4T

    I hope this is of interest.

    Cheers,

    JohnGG

Re: One regex construct to handle multiple string types
by oko1 (Deacon) on Nov 29, 2008 at 14:23 UTC

    If you're looking for a single digit followed by a capital letter (this is a wild guess, since you might just be looking for a literal '2L' anywhere in the string - it's rather hard to tell), then this simple regex will do:

    while (<DATA>){ chomp; print "$1\n" if /(\d[A-Z])/; }

    --
    "Language shapes the way we think, and determines what we can think about."
    -- B. L. Whorf
Re: One regex construct to handle multiple string types
by grinder (Bishop) on Nov 29, 2008 at 18:18 UTC
    use strict; use warnings; use Regexp::Assemble; my $r = Regexp::Assemble->new; $r->add('2L', 'bar.2L'); my $pattern = qr/($r)/;

    oops, no, I didn't see it was just the '2L' part you were interested. Oh well, rather than removing the code I may as well leave it in case it solves someone else's problem :)

    • another intruder with the mooring in the heart of the Perl

Re: One regex construct to handle multiple string types
by JavaFan (Canon) on Dec 01, 2008 at 12:54 UTC
    The problem with asking a questing about how to match string, and just providing two examples of the problem is that noone knows what you mean. Some simple "solutions" that match all your examples include:
    /(2L)/ /(..)$/ /(\p{Nd}\p{LU})/ /(?:bar.|^)(..)/ (split /\./)[0] /(?:.*\.)?(.*)/ /([^.]+)$/

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://726738]
Approved by parv
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (5)
As of 2024-04-23 17:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found