http://qs321.pair.com?node_id=402627


in reply to Re^2: Regex help
in thread Regex help

With the arrival of Unicode, it's wrong to use \d if you mean [0-9].

This worries me.
What happened when I wasn't paying attention?

Will most of my code that's processing text containing digits break as soon as the input contains unicode?

Cheers, Sören

  • Comment on With Unicode, \d is wrong if you mean [0-9]

Replies are listed 'Best First'.
Re: With Unicode, \d is wrong if you mean [0-9]
by TimToady (Parson) on Oct 26, 2004 at 15:40 UTC
    That depends on how desperately you want to prevent people from typing in numerals in:
    Arabic
    Devanagari
    Bengali
    Gurmukhi
    Gujariti
    Oriya
    Tamil
    Teluga
    Kannada
    Malayalam
    Thai
    Lao
    Tibetan
    Myanmar
    Ethiopic
    Khmer
    Mongolian
    Limbu
    Chinese
    Japanese
    Korean
    Vietnamese
    
    But it's not like \d is going to start throwing exceptions merely because you feed it Unicode.
Re: With Unicode, \d is wrong if you mean [0-9]
by olivierp (Hermit) on Oct 26, 2004 at 15:44 UTC
    I'd say it depends on what you mean with "breaks".
    Your code will still match [0-9] as you are used to, but will also match other characters defined as "digits" in other "scripts". If you depend on a "Latin" digit elsewhere in the code, I think you may have undesired side effects.

    --
    Olivier
Re: With Unicode, \d is wrong if you mean [0-9]
by hardburn (Abbot) on Oct 26, 2004 at 15:46 UTC

    No, \d will match digit characters in many languages (as TimToady mentioned). I think it's more accurate to say that it's wrong to mean [0-9], as letting people put in digits in whatever langauge they want is usually the right thing.

    "There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.

      Hmm, if you want to add one to it, it probably wants to consist of [0-9]+ rather than \d+.

      Hugo

        That could be construed as a bug in Perl's internal grok_number() routine.