Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Making $ Unicode-aware

by jo37 (Pilgrim)
on Jul 26, 2020 at 18:31 UTC ( #11119834=perlmeditation: print w/replies, xml ) Need Help??

How bad is this idea:

To my understanding, $ in a regex (without the m modifier) is equivalent to (?=\n?\z), i.e. "Match the end of the string (or before newline at the end of the string)". With Unicode, the meaning of "newline" may be extended to "Linebreak", aka \R.

Wouldn't it be nice to make $ behave as (?=\R?\z) under some pragma or flag? (Without \z when the m flag is present, of course.)

I believe this wouldn't even break much existing code. Invented € for the "new" $ here.

#!/usr/bin/perl use v5.14; use warnings; use utf8; use charnames qw(:full :short); use feature 'say'; for ("noeol", "nl\n", "cr\r", "cr_nl\r\n") { my $u_chomped = s/\R//r; say "$u_chomped:"; say 'matches $' if /^\p{word}*$/; say 'matches like $' if /^\p{word}*(?=\n?\z)/; say 'matches €' if /^\p{word}*(?=\R?\z)/; say 'matches \r$' if /^\p{word}*\r$/; say 'matches \r€' if /^\p{word}*\r(?=\R?\z)/; say 'matches \r?$' if /^\p{word}*\r?$/; say 'matches \r?€' if /^\p{word}*\r?(?=\R?\z)/; /^(.*)$/; say 'captured (.*)$' if $1 eq $u_chomped; /^(.*)(?=\R?\z)/; say 'captured (.*)€' if $1 eq $u_chomped; /^(.*).$/; say 'captured (.*).$' if $1 eq $u_chomped; /^(.*).(?=\R?\z)/; say 'captured (.*).€' if $1 eq $u_chomped; say "\n"; } __DATA__ noeol: matches $ matches like $ matches € matches \r?$ matches \r?€ captured (.*)$ captured (.*)€ nl: matches $ matches like $ matches € matches \r?$ matches \r?€ captured (.*)$ captured (.*)€ cr: matches € matches \r$ matches \r€ matches \r?$ matches \r?€ captured (.*).$ captured (.*).€ cr_nl: matches € matches \r$ matches \r€ matches \r?$ matches \r?€ captured (.*).$ captured (.*).€

Greetings,
-jo

$gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$

Replies are listed 'Best First'.
Re: Making $ Unicode-aware
by jcb (Vicar) on Jul 27, 2020 at 02:29 UTC

    A better question is what would we gain by doing this? How is \R different from \n? Are there other linebreaks meaningfully different than ASCII LF, or is the Unicode committee just wasting codepoints again?

      From perlrebackslash:

      \R is equivalent to (?>\x0D\x0A|\v)

      Greetings,
      -jo

      $gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$

        Is that really intended to only match CRLF or should it be (?>\x0D?\x0A|\v) to also match traditional *nix line endings? (There is still a problem with (?>\x0D?\x0A|\v) — it does not match the traditional CR-only Macintosh line ending.) Why is vertical tab included?

Re: Making $ Unicode-aware
by Anonymous Monk on Jul 27, 2020 at 07:40 UTC
    Unicode is much more complex than that
Re: Making $ Unicode-aware
by Anonymous Monk on Jul 27, 2020 at 21:34 UTC
    Echoing jcb ... "what might we risk?"

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://11119834]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (4)
As of 2020-09-19 03:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    If at first I don’t succeed, I …










    Results (114 votes). Check out past polls.

    Notices?