Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Regexp: How to match in middle but not the ends?

by cormanaz (Deacon)
on Jul 28, 2006 at 20:29 UTC ( [id://564451]=perlquestion: print w/replies, xml ) Need Help??

cormanaz has asked for the wisdom of the Perl Monks concerning the following question:

Could a regexp Monk help me out?

I have long strings built of the character class [\-LCHS], and they are mostly \-. I want to loop through these strings and find maximal sets of two or more of the capital letters that do not begin or end with L. So for example

my $string = '---LL--C----LCSH-------CSHL-------LCSLH-------LCCHLSHCL- +---'; while ($string =~ /[CSH][CSHL]+/g) { print "$&, "; }

The output I want from this is: CSH, CSH, CSLH, CCHLSHC. But currently is messes up the 2nd and 4th strings by including an L at the end.

I have been scratching my head over this for quite a while. I can exclude an L at the beginning or the end. But since the L can be legal in the the middle, I can't figure out how to exclude it on the other end also.

Thanks....

Steve

Replies are listed 'Best First'.
Re: Regexp: How to match in middle but not the ends?
by ikegami (Patriarch) on Jul 28, 2006 at 20:46 UTC
    Don't use $&! It slows down all the regexp in your program (including modules) that don't have captures. Use captures instead.
    while ($string =~ /([CSH][CSHL]*[CSH])/g) { print "$1, "; }

    Use join to avoid the trailing comma.

    print join ', ', $string =~ /([CSH][CSHL]*[CSH])/g;

    An alternative approach would be to strip out the offending L characters.

    for ($string) { s/-L+|L+-/-/g; s/^L+//; s/L+$//; print join ', ', /([CSHL]{2,})/g; }

    Update: Added s/^L// and s/L$//.
    Update: Changed L to L+.
    Update: Accidently changed too many things to "+"s. Fixed.

      If you strip the L, don't forget (like I did first) to strip an L at the start and end of the string. You should also use "+" to strip any sequence of L. I think, this will do: s/(?:^L+)|(?:-L+)|(?:L+-)|(?:L+$)/-/g (I used (?:) because I'm currently too busy (lazy) to test...

      s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
      +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e

        No, he shouldn't use "+", since he doesn't want to strip any sequence of "L". For example, your solution fails to match "LL" in "---LLLL---".

        My solution was lacking since I wasn't checking for an L at the start or end of the string. Fixes:

        $string =~ s/-L|L-/-/g; $string =~ s/^L//; $string =~ s/L$//;
        or
        $string =~ s/^L|(?<-)L|L(?=-)|L$//g;
        or
        # Does a bit more than stripping, but in an inconsequential fashion. $string =~ s/^L|-L|L-|L$/-/g;
      print join ', ', $string =~ /([CSH][CSHL]+[CSH])/g;
      matches at least 3 characters and thus should be print join ', ', $string =~ /([CSH][CSHL]*[CSH])/g; (as you mention somewhere else in this thread).

      -- Hofmator

Re: Regexp: How to match in middle but not the ends?
by Hue-Bond (Priest) on Jul 28, 2006 at 20:36 UTC

    Not being a regex expert, I've come up with this:

    my $string = '---LL--C----LCSH-------CSHL-------LCSLH-------LCCHLSHCL- +---'; while ($string =~ /([CHS][LCHS]+[CHS])/xg) { print "$1\n"; } __END__ CSH CSH CSLH CCHLSHC

    --
    David Serrano

Re: Regexp: How to match in middle but not the ends?
by Fletch (Bishop) on Jul 28, 2006 at 20:32 UTC

    Perhaps you should read perlre and see if a non-greedy modifier /[CSH][CSHL]+?/ helps? Or explicitly require that the last character isn't an L, /[CSH][CSHL]*[CSH]/g.

      /[CSH][CSHL]+?/ doesn't work. It would incorrectly match "CL" in "---CL---", and it will never match more than two characters.

      /[CSH][CSHL]*[CSH]/ works.

      If there wasn't a two character minimum, we'd have to use zero-width lookaheads and/or lookbehinds.

Re: Regexp: How to match in middle but not the ends?
by explorer (Chaplain) on Jul 28, 2006 at 20:36 UTC

    You said the solution...

    my $string = '---LL--C----LCSH-------CSHL-------LCSLH-------LCCHLSHCL- +---'; # I want to loop through these strings and # find maximal sets of two or more of the capital letters # that do not begin or end with L while ($string =~ /([CSH][CSHL]*[CSH])/g) { print "$1, "; } __OUTPUT__ CSH, CSH, CSLH, CCHLSHC,
Re: Regexp: How to match in middle but not the ends?
by Skeeve (Parson) on Jul 28, 2006 at 23:45 UTC
    TIMTOWDI, and assuming, your string need not be checked against containing only legal characters:
    my $string = '---LL--C----LCSH-------CSHL-------LCSLH-------LCCHLSHCL- +---'; foreach (split /L*-+L*/, $string) { print $_,"\n" if length($_)>2; }
    Update: I just noticed: This will fail if the string starts or ends with "L" and not with "-" or anything else. So making the $string in the split a "-$string-" is one workaround.

    s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
    +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
Re: Regexp: How to match in middle but not the ends?
by furry_marmot (Pilgrim) on Jul 31, 2006 at 10:34 UTC
    Taking the poster's exact description, "I want to loop through these strings and find maximal sets of two or more of the capital letters that do not begin or end with L." I came up with this:
    my $string = '---LL--C----LCSH-------CSHL-------LCSLH-------LCCHLSHCL- +---'; @patterns = $string =~ /([^-L][LCSH]*[^-L])/g; print join ', ', @patterns;
    The result is this:

    CSH, CSH, CSLH, CCHLSHC

    Is that what you were looking for?

    --marmot

Re: Regexp: How to match in middle but not the ends?
by TedPride (Priest) on Jul 30, 2006 at 17:56 UTC
    Perhaps the simplest solution is a two-part regex sequence:
    $_ = '---LL--C----LCSH-------CSHL-------LCSLH-------LCCHLSHCL----'; while (m/([A-Z]+)/g) { ($s = $1) =~ s/^L+|L+$//g; print "$s\n" if $s; }
    EDIT: Oops, missed that.

      It incorectly matches "C" in "-C-". (Two chars minimum.)

      $_ = '---LL--C----LCSH-------CSHL-------LCSLH-------LCCHLSHCL----'; while (m/([A-Z]+)/g) { ($s = $1) =~ s/^L+|L+$//g; print "$s\n" if length $s >= 2; }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://564451]
Approved by socketdave
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (5)
As of 2024-03-29 12:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found