\b in Unicode regex

Arik123 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks!

I do

$string =~ /$_/

and it matches. I do

$string =~ /\b$_\b/

and it doesn't match, for the same values of $string and $_. I think it should match, since there's a hyphen or a dot after $_ in $string, which I think should match \b. Both $string and $_ are Unicode. Could it be that \b doesn't function for Unicode strings?

Comment on \b in Unicode regex Select or Download Code

Replies are listed 'Best First'.
Re: \b in Unicode regex by shmem (Chancellor) on May 22, 2017 at 06:52 UTC
Why don't you write down what `$string` and `$_` contain, so we don't have to guess? See I know what I mean. Why don't you? `$s = "hüh-hott"; $t = "hüh"; print "matches: '$&'\n" if $s =~ /$t/; print "matches too: '$&'\n" if $s =~ /\b$t\b/; __END__ matches: 'hüh' matches too: 'hüh'` [download] perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'	[reply] [d/l] [select]
Re^2: \b in Unicode regex by Arik123 (Beadle) on May 22, 2017 at 07:30 UTC
The actual strings are quite a mess. I just wanted to know whether there's some issue with \b in Unicode. If you insist, then $string is something like 8^1589-20170113-102647-ויחי-דב&#15 +12;י_הספד_על_הר +ב_משה_שפירא.mp3 +^עברית^הרב מ&#1 +504;שה גולד^ויח +י-דברי הספד &#1 +506;ל הרב משה ש&#1508 +;ירא, טו' טבת, &#1514 +;שע'ז^שיעורי&#1 +501; בתנ"ך ובפר&#1513 +;ת השבוע\|שיע&#1 +493;רים בפרשת ה +שבוע\|שיעור&#149 +7;ם קודמים\|בר&# +1488;שית\|ויחי [download] and $_ is just `שפירא` [download] (it's hebrew, and I'm afraid your broweser might mess up the right-to-left presentation, or even just show the Unicode numbers instead of the characters themselves. My browser makes a mess here. That's why I didn't think posting the strings would help).	[reply] [d/l] [select]
Re^3: \b in Unicode regex by choroba (Cardinal) on May 22, 2017 at 07:41 UTC
\b works for me, even with Hebrew: #! /usr/bin/perl use warnings; use strict; use utf8; my $string = 'שָׁלוֹם'; print $string =~ /\bש/, "\n"; (I had to use `<pre>` instead of `<code>` to make UTF-8 work.) ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l] [select]
Re^3: \b in Unicode regex by shmem (Chancellor) on May 22, 2017 at 08:55 UTC
Given your strings, they match fine with or without `\b`: #!/usr/bin/perl -CS use HTML::Entities; my $string = decode_entities <DATA>; $_ = decode_entities "שפירא"; print "matches: '$&'\n" if $string =~ /$_/; print "matches too: '$&'\n" if $string =~ /\b$_\b/; __DATA__ 8^1589-20170113-102647-ויחי-דב&#15 +12;י_הספד_על_הר +ב_משה_שפירא.mp3 +^עברית^הרב מ&#1 +504;שה גולד^ויח +י-דברי הספד &#1 +506;ל הרב משה ש&#1508 +;ירא, טו' טבת, &#1514 +;שע'ז^שיעורי&#1 +501; בתנ"ך ובפר&#1513 +;ת השבוע\|שיע&#1 +493;רים בפרשת ה +שבוע\|שיעור&#149 +7;ם קודמים\|בר&# +1488;שית\|ויחי __END__ [download] Output: matches: 'שפירא' matches too: 'שפירא' So, no issue with `\b` and unicode regex here. perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'	[reply] [d/l] [select]
Re^3: \b in Unicode regex by Anonymous Monk on May 23, 2017 at 09:23 UTC
Thanks a lot, Monks. Knowing that there's no issue wuth \b, I kept investigating. Turned out that one of the strings wasn't really utf8 (for some reason, my terminal insisted on printing it as utf8, though). utf8::decode solved the problem.	[reply]
Re^4: \b in Unicode regex by ikegami (Patriarch) on May 23, 2017 at 14:08 UTC
Re: \b in Unicode regex by kennethk (Abbot) on May 22, 2017 at 15:00 UTC
Do you mean `$string =~ /$_/; $string =~ /\b$_\b/;` [download] or do you really mean `$string =~ /\Q$_\E/; $string =~ /\b\Q$_\E\b/;` [download] As soon as your variable contains Metacharacters, they are not the same. See quotemeta, Quoting metacharacters. #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.	[reply] [d/l] [select]
Re: \b in Unicode regex by kcott (Archbishop) on May 23, 2017 at 06:56 UTC
G'day Arik123, Two pieces of information, from perlrebackslash, to note. From the "Character classes" section: "`\w` s a character class that matches any single word character (letters, digits, Unicode marks, and connector punctuation (like the underscore))." [my emphasis] From the "Assertions" section: "`\b` ... matches at any place between a word (something matched by `\w`) and a non-word character" [my emphasis again] In your reply with actual data, you're effectively trying to match "`XXXXX`", which occurs in your string as "`_XXXXX.`". Both '`_`' and '`X`' match "`\w`": "`\b`" does not match between '`_`' and '`X`'. As already demonstrated twice^[1,2], there is no Unicode issue here. — Ken	[reply] [d/l] [select]
Re^2: \b in Unicode regex by Arik123 (Beadle) on May 23, 2017 at 09:28 UTC
The string I tried to match (that $_) is actually found twise in $string. In the first time it's indeed preceded by _, but in the second time it's between a space and a , That you all for your time, again.	[reply]
Re^3: \b in Unicode regex by kcott (Archbishop) on May 24, 2017 at 04:51 UTC
I was certain that I checked that before posting my reply; however, I went back and doubled checked just now. `שפירא` [download] occurs only once, in the substring `ה_שפירא.mp3` [download] We can only comment on the data you show us. — Ken	[reply] [d/l] [select]
Re^4: \b in Unicode regex by Arik123 (Beadle) on May 24, 2017 at 06:11 UTC

Back to Seekers of Perl Wisdom