Looking for a cleaner regex

k-man has asked for the wisdom of the Perl Monks concerning the following question:

I'm looking for a cleaner solution if one exists. I'm highlighting Spanish verb endings. The regex I came up with seems to work but I can't help but be curious if there is a better way.

Here is what I came up with: (it's super long and I don't know how to prevent it from wrapping)

hilite_tail () {
    local WORD=$1
    local COLOR=$2
    PFX=$(echo $WORD | perl -pe 's/(.*)(ste(?=$)|e(?<!ste)(?=$)|es(?=$
+)|is(?<!(eis|ste))(?=$)|mos(?=$)|n(?<!ron)(?=$)|o(?=$)|ó(?=$)|.ron(?=
+$)|s(?<!(eis|.is|mos))(?=$)|steis(?=$)|é(?<!ué)(?=$)|í(?=$)|ué(?=$))/
+$1/g')
    SFX=$(echo $WORD | perl -pe 's/(.*)(ste(?=$)|e(?<!ste)(?=$)|es(?=$
+)|is(?<!(eis|ste))(?=$)|mos(?=$)|n(?<!ron)(?=$)|o(?=$)|ó(?=$)|.ron(?=
+$)|s(?<!(eis|.is|mos))(?=$)|steis(?=$)|é(?<!ué)(?=$)|í(?=$)|ué(?=$))/
+$2/g')
    [[ $PFX != $SFX ]] && echo "${CYAN_FG}$PFX${COLOR}$SFX${RESET}" ||
+ echo "${CYAN_FG}$WORD${RESET}" 
}
[download]

Using the advice I've now simplified:

hilite_tail () {
    local WORD=$1
    local COLOR=$2
    PFX=$(echo $WORD | perl -pe 's/(.*)(ste|(?<!st)e|es|is(?<!(eis|ste
+))|mos|(?<!ro)n|o|ó|.ron|s(?<!(.is|mos))|steis|(?<!u)é|í|ué)$/$1/g')
    SFX=$(echo $WORD | perl -pe 's/(.*)(ste|(?<!st)e|es|is(?<!(eis|ste
+))|mos|(?<!ro)n|o|ó|.ron|s(?<!(.is|mos))|steis|(?<!u)é|í|ué)$/$2/g')
    [[ $PFX != $SFX ]] && echo "${CYAN_FG}$PFX${COLOR}$SFX${RESET}" ||
+ echo "${CYAN_FG}$WORD${RESET}" 
}
[download]

Thanks

Comment on Looking for a cleaner regex Select or Download Code

Replies are listed 'Best First'.
Re: Looking for a cleaner regex by Eily (Monsignor) on Dec 08, 2017 at 11:28 UTC
I see some simplifications: `/e(?<!ste)/` means find a e and put the cursor right after it, then read the three letters before the cursor (e included) and check that this is not "ste". This is simply "a e not following st", so `/(?<!st)e/` (only slightly shorter, but that's several chars overall) `/is(?<!ste)/` isn't very useful, if you put the match cursor right behind "is", the previous three letters obviously can't be "ste". `(?=PATTERN)` makes perl check what's next without moving the match cursor. $ already check that what's next is the end of the string or of a line without moving the cursor. So (?=$) and $ mean the same thing.	[reply] [d/l] [select]
Re^2: Looking for a cleaner regex by k-man (Novice) on Dec 08, 2017 at 11:39 UTC
Thanks for the info. The is/ste/ thing was a mistake... Can you give the alternative example to (?=$) because I was having trouble without it.	[reply]
Re^3: Looking for a cleaner regex by hippo (Bishop) on Dec 08, 2017 at 11:59 UTC
If I want to find a string ending in foo or bar or baz, I just match against `/(foo\|bar\|baz)$/` rather than repeat the EOL test in the regex. If that doesn't solve it for you perhaps using the techniques described in How to ask better questions using Test::More and sample data will help me and others understand where your problem lies.	[reply] [d/l]
Re: Looking for a cleaner regex by QM (Parson) on Dec 08, 2017 at 14:38 UTC
See Regexp::Optimizer -QM -- Quantum Mechanics: The dreams stuff is made of	[reply]
Re^2: Looking for a cleaner regex ( trie since 5.10 ! ) by LanX (Saint) on Dec 08, 2017 at 14:45 UTC
> See Regexp::Optimizer The author should add a prominent remark that Perl is supporting trie optimization for over 10 years already. .. perl5100delta#Trie-optimisation-of-literal-string-alternations So what's the benefit of this module with current Perl, unless you generate a regex for another language? update: see also Re^3: Looking for a cleaner regex (trie) ff Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Wikisyntax for the Monastery}	[reply]
Re^3: Looking for a cleaner regex ( trie since 5.10 ! ) by Anonymous Monk on Dec 08, 2017 at 15:12 UTC
making the regex shorter and easier to read?	[reply]
Re^4: Looking for a cleaner regex ( trie since 5.10 ! ) by choroba (Cardinal) on Dec 08, 2017 at 15:22 UTC
Re^5: Looking for a cleaner regex ( trie since 5.10 ! ) by LanX (Saint) on Dec 08, 2017 at 15:42 UTC
Re^3: Looking for a cleaner regex ( trie since 5.10 ! ) by Anonymous Monk on Dec 09, 2017 at 10:46 UTC
It isnt limited by trie max limit maybe?	[reply]
Re: Looking for a cleaner regex by Anonymous Monk on Dec 08, 2017 at 10:59 UTC
Um people dont write clean regex in a oneliner. Then its embedded in shell script. Also people dont write regex for just a list of words. regex presuf regex trie	[reply]
Re^2: Looking for a cleaner regex by k-man (Novice) on Dec 08, 2017 at 11:31 UTC
It's not just a list of words. It's a generalized regex to recognize standard verb forms in Spanish. That it's embedded in a shell script doesn't negate the fact that it's perl. It is a one liner. You got that much!	[reply]
Re^3: Looking for a cleaner regex (trie) by LanX (Saint) on Dec 08, 2017 at 13:18 UTC
Why don't you just trust in the internal trie optimization? Update: eg compare Re: How to tokenize string by custom dictionary? (+code) Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Wikisyntax for the Monastery}	[reply]
Re^3: Looking for a cleaner regex by Anonymous Monk on Dec 08, 2017 at 14:04 UTC
Are you asking for help with shell or regex? Are you honestly editing this regex as presented in a shell script? You know how you got newlines and indenting in your shell script? Your regex is easily 10 times more complex than hilite_tail. You might need it in one line for your shell(doubtfull) but its silly for human to write code/regex without whitespace. Accoreing to your code Spanish verb endings is a static list of 20ish words -- if you think that requires optimization you havent timed it. To speedup what you have Replace guts of hilite_tail with single call to perl that does it.	[reply]
Re: Looking for a cleaner regex by Anonymous Monk on Dec 13, 2017 at 03:29 UTC
Using the advice I've now simplified: You've still got the same look-behind typos as described in Re: Looking for a cleaner regex Guessing at the meaning the look behinds, the list of words is now `.ron steis ste mos es o ó í ué (?<!u)é (?<!st)e (?<!ro)n (?<!.is)s (?<!mos)s (?<!eis)is (?<!ste)is` [download] Eliminating the obviously needless look-behinds , leaving the goofy guesses and guessing again `.ron steis ste mos es o ó í ué é e n (?<!.is)s ## maybe meant (?<!.i)s (?<!mos)s ## maybe meant (?<!mo)s (?<!eis)is ## maybe meant (?<!e)is (?<!ste)is ## maybe meant (?<!ste)is` [download] If my new guess is correct the last 4 needless look behinds can be replaced by "s" `steis .ron ste mos es ué o ó í é e n s` [download] And the single character suffix can be replaced by a char class `steis .ron ste mos es ué [oóíéens]` [download] So there you have it the optimal regex is `qr/( .ron \| steis \| ste \| mos \| es \| ué \| [oóíéens] )$/x`	[reply] [d/l] [select]


Come for the quick hacks, stay for the epiphanies.
	PerlMonks