http://qs321.pair.com?node_id=1190933


in reply to \b in Unicode regex

G'day Arik123,

Two pieces of information, from perlrebackslash, to note.

From the "Character classes" section:

"\w s a character class that matches any single word character (letters, digits, Unicode marks, and connector punctuation (like the underscore))." [my emphasis]

From the "Assertions" section:

"\b ... matches at any place between a word (something matched by \w) and a non-word character" [my emphasis again]

In your reply with actual data, you're effectively trying to match "XXXXX", which occurs in your string as "_XXXXX.". Both '_' and 'X' match "\w": "\b" does not match between '_' and 'X'.

As already demonstrated twice[1,2], there is no Unicode issue here.

— Ken

Replies are listed 'Best First'.
Re^2: \b in Unicode regex
by Arik123 (Beadle) on May 23, 2017 at 09:28 UTC

    The string I tried to match (that $_) is actually found twise in $string. In the first time it's indeed preceded by _, but in the second time it's between a space and a ,

    That you all for your time, again.

      I was certain that I checked that before posting my reply; however, I went back and doubled checked just now.

      שפירא

      occurs only once, in the substring

      ה_שפירא.mp3

      We can only comment on the data you show us.

      — Ken

        That's not really important, now that the issue is solved. However, that substring does indeed appear twise. I don't know if your browser works like mine, but if it does, then the substring you refer to occurs in the third line of the big string I posted, and the second occurance is in the 7th and 8th lines (my browser prints a + sign at every linebreak. Maybe that's what confused you).

        Again, thank you all Monks for your time and help.