http://qs321.pair.com?node_id=402608


in reply to Re: Regex help
in thread Regex help

That would match B4๖:
m/B(?:3[89]|4\d)/ and print for "B4\x{E56}"
With the arrival of Unicode, it's wrong to use \d if you mean [0-9].

Replies are listed 'Best First'.
With Unicode, \d is wrong if you mean [0-9]
by Happy-the-monk (Canon) on Oct 26, 2004 at 15:04 UTC

    With the arrival of Unicode, it's wrong to use \d if you mean [0-9].

    This worries me.
    What happened when I wasn't paying attention?

    Will most of my code that's processing text containing digits break as soon as the input contains unicode?

    Cheers, Sören

      That depends on how desperately you want to prevent people from typing in numerals in:
      Arabic
      Devanagari
      Bengali
      Gurmukhi
      Gujariti
      Oriya
      Tamil
      Teluga
      Kannada
      Malayalam
      Thai
      Lao
      Tibetan
      Myanmar
      Ethiopic
      Khmer
      Mongolian
      Limbu
      Chinese
      Japanese
      Korean
      Vietnamese
      
      But it's not like \d is going to start throwing exceptions merely because you feed it Unicode.
      I'd say it depends on what you mean with "breaks".
      Your code will still match [0-9] as you are used to, but will also match other characters defined as "digits" in other "scripts". If you depend on a "Latin" digit elsewhere in the code, I think you may have undesired side effects.

      --
      Olivier

      No, \d will match digit characters in many languages (as TimToady mentioned). I think it's more accurate to say that it's wrong to mean [0-9], as letting people put in digits in whatever langauge they want is usually the right thing.

      "There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.

        Hmm, if you want to add one to it, it probably wants to consist of [0-9]+ rather than \d+.

        Hugo