Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Looking for a cleaner regex

by k-man (Novice)
on Dec 08, 2017 at 10:38 UTC ( [id://1205162]=perlquestion: print w/replies, xml ) Need Help??

k-man has asked for the wisdom of the Perl Monks concerning the following question:

I'm looking for a cleaner solution if one exists. I'm highlighting Spanish verb endings. The regex I came up with seems to work but I can't help but be curious if there is a better way.

Here is what I came up with: (it's super long and I don't know how to prevent it from wrapping)

hilite_tail () { local WORD=$1 local COLOR=$2 PFX=$(echo $WORD | perl -pe 's/(.*)(ste(?=$)|e(?<!ste)(?=$)|es(?=$ +)|is(?<!(eis|ste))(?=$)|mos(?=$)|n(?<!ron)(?=$)|o(?=$)|ó(?=$)|.ron(?= +$)|s(?<!(eis|.is|mos))(?=$)|steis(?=$)|é(?<!ué)(?=$)|í(?=$)|ué(?=$))/ +$1/g') SFX=$(echo $WORD | perl -pe 's/(.*)(ste(?=$)|e(?<!ste)(?=$)|es(?=$ +)|is(?<!(eis|ste))(?=$)|mos(?=$)|n(?<!ron)(?=$)|o(?=$)|ó(?=$)|.ron(?= +$)|s(?<!(eis|.is|mos))(?=$)|steis(?=$)|é(?<!ué)(?=$)|í(?=$)|ué(?=$))/ +$2/g') [[ $PFX != $SFX ]] && echo "${CYAN_FG}$PFX${COLOR}$SFX${RESET}" || + echo "${CYAN_FG}$WORD${RESET}" }

Using the advice I've now simplified:

hilite_tail () { local WORD=$1 local COLOR=$2 PFX=$(echo $WORD | perl -pe 's/(.*)(ste|(?<!st)e|es|is(?<!(eis|ste +))|mos|(?<!ro)n|o|ó|.ron|s(?<!(.is|mos))|steis|(?<!u)é|í|ué)$/$1/g') SFX=$(echo $WORD | perl -pe 's/(.*)(ste|(?<!st)e|es|is(?<!(eis|ste +))|mos|(?<!ro)n|o|ó|.ron|s(?<!(.is|mos))|steis|(?<!u)é|í|ué)$/$2/g') [[ $PFX != $SFX ]] && echo "${CYAN_FG}$PFX${COLOR}$SFX${RESET}" || + echo "${CYAN_FG}$WORD${RESET}" }

Thanks

Replies are listed 'Best First'.
Re: Looking for a cleaner regex
by Eily (Monsignor) on Dec 08, 2017 at 11:28 UTC

    I see some simplifications:

    /e(?<!ste)/ means find a e and put the cursor right after it, then read the three letters before the cursor (e included) and check that this is not "ste". This is simply "a e not following st", so /(?<!st)e/ (only slightly shorter, but that's several chars overall)

    /is(?<!ste)/ isn't very useful, if you put the match cursor right behind "is", the previous three letters obviously can't be "ste".

    (?=PATTERN) makes perl check what's next without moving the match cursor. $ already check that what's next is the end of the string or of a line without moving the cursor. So (?=$) and $ mean the same thing.

      Thanks for the info. The is/ste/ thing was a mistake... Can you give the alternative example to (?=$) because I was having trouble without it.

        If I want to find a string ending in foo or bar or baz, I just match against /(foo|bar|baz)$/ rather than repeat the EOL test in the regex. If that doesn't solve it for you perhaps using the techniques described in How to ask better questions using Test::More and sample data will help me and others understand where your problem lies.

Re: Looking for a cleaner regex
by QM (Parson) on Dec 08, 2017 at 14:38 UTC
        making the regex shorter and easier to read?
        It isnt limited by trie max limit maybe?
Re: Looking for a cleaner regex
by Anonymous Monk on Dec 08, 2017 at 10:59 UTC
    Um people dont write clean regex in a oneliner. Then its embedded in shell script. Also people dont write regex for just a list of words.

    regex presuf

    regex trie

      It's not just a list of words. It's a generalized regex to recognize standard verb forms in Spanish. That it's embedded in a shell script doesn't negate the fact that it's perl. It is a one liner. You got that much!

        Are you asking for help with shell or regex? Are you honestly editing this regex as presented in a shell script? You know how you got newlines and indenting in your shell script? Your regex is easily 10 times more complex than hilite_tail. You might need it in one line for your shell(doubtfull) but its silly for human to write code/regex without whitespace. Accoreing to your code Spanish verb endings is a static list of 20ish words -- if you think that requires optimization you havent timed it. To speedup what you have Replace guts of hilite_tail with single call to perl that does it.
Re: Looking for a cleaner regex
by Anonymous Monk on Dec 13, 2017 at 03:29 UTC

    Using the advice I've now simplified:

    You've still got the same look-behind typos as described in Re: Looking for a cleaner regex

    Guessing at the meaning the look behinds, the list of words is now

    .ron steis ste mos es o ó í ué (?<!u)é (?<!st)e (?<!ro)n (?<!.is)s (?<!mos)s (?<!eis)is (?<!ste)is

    Eliminating the obviously needless look-behinds , leaving the goofy guesses and guessing again

    .ron steis ste mos es o ó í ué é e n (?<!.is)s ## maybe meant (?<!.i)s (?<!mos)s ## maybe meant (?<!mo)s (?<!eis)is ## maybe meant (?<!e)is (?<!ste)is ## maybe meant (?<!ste)is

    If my new guess is correct the last 4 needless look behinds can be replaced by "s"

    steis .ron ste mos es ué o ó í é e n s

    And the single character suffix can be replaced by a char class

    steis .ron ste mos es ué [oóíéens]

    So there you have it the optimal regex is  qr/( .ron | steis | ste | mos | es | ué | [oóíéens] )$/x

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1205162]
Approved by davies
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (2)
As of 2024-04-20 05:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found