http://qs321.pair.com?node_id=402372


in reply to Regex help

Another method:

m/B(?:3[89]|4\d)/
May the Force be with you

Replies are listed 'Best First'.
Re^2: Regex help
by Anonymous Monk on Oct 26, 2004 at 14:39 UTC
    That would match B4๖:
    m/B(?:3[89]|4\d)/ and print for "B4\x{E56}"
    With the arrival of Unicode, it's wrong to use \d if you mean [0-9].

      With the arrival of Unicode, it's wrong to use \d if you mean [0-9].

      This worries me.
      What happened when I wasn't paying attention?

      Will most of my code that's processing text containing digits break as soon as the input contains unicode?

      Cheers, Sören

        That depends on how desperately you want to prevent people from typing in numerals in:
        Arabic
        Devanagari
        Bengali
        Gurmukhi
        Gujariti
        Oriya
        Tamil
        Teluga
        Kannada
        Malayalam
        Thai
        Lao
        Tibetan
        Myanmar
        Ethiopic
        Khmer
        Mongolian
        Limbu
        Chinese
        Japanese
        Korean
        Vietnamese
        
        But it's not like \d is going to start throwing exceptions merely because you feed it Unicode.
        I'd say it depends on what you mean with "breaks".
        Your code will still match [0-9] as you are used to, but will also match other characters defined as "digits" in other "scripts". If you depend on a "Latin" digit elsewhere in the code, I think you may have undesired side effects.

        --
        Olivier

        No, \d will match digit characters in many languages (as TimToady mentioned). I think it's more accurate to say that it's wrong to mean [0-9], as letting people put in digits in whatever langauge they want is usually the right thing.

        "There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.