Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

\b in Unicode regex

by Arik123 (Beadle)
on May 22, 2017 at 06:45 UTC ( #1190836=perlquestion: print w/replies, xml ) Need Help??

Arik123 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks!

I do

$string =~ /$_/

and it matches. I do

$string =~ /\b$_\b/

and it doesn't match, for the same values of $string and $_. I think it should match, since there's a hyphen or a dot after $_ in $string, which I think should match \b. Both $string and $_ are Unicode. Could it be that \b doesn't function for Unicode strings?

Replies are listed 'Best First'.
Re: \b in Unicode regex
by shmem (Chancellor) on May 22, 2017 at 06:52 UTC

    Why don't you write down what $string and $_ contain, so we don't have to guess? See I know what I mean. Why don't you?

    $s = "hüh-hott"; $t = "hüh"; print "matches: '$&'\n" if $s =~ /$t/; print "matches too: '$&'\n" if $s =~ /\b$t\b/; __END__ matches: 'hüh' matches too: 'hüh'
    perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'

      The actual strings are quite a mess. I just wanted to know whether there's some issue with \b in Unicode. If you insist, then $string is something like

      8^1589-20170113-102647-ויחי-דב&#15 +12;י_הספד_על_הר +ב_משה_שפירא.mp3 +^עברית^הרב מ&#1 +504;שה גולד^ויח +י-דברי הספד &#1 +506;ל הרב משה ש&#1508 +;ירא, טו' טבת, &#1514 +;שע'ז^שיעורי&#1 +501; בתנ"ך ובפר&#1513 +;ת השבוע|שיע&#1 +493;רים בפרשת ה +שבוע|שיעור&#149 +7;ם קודמים|בר&# +1488;שית|ויחי

      and $_ is just

      שפירא

      (it's hebrew, and I'm afraid your broweser might mess up the right-to-left presentation, or even just show the Unicode numbers instead of the characters themselves. My browser makes a mess here. That's why I didn't think posting the strings would help).

        \b works for me, even with Hebrew:
        #! /usr/bin/perl
        use warnings;
        use strict;
        use utf8;
        my $string = 'שָׁלוֹם';
        print $string =~ /\bש/, "\n";
        

        (I had to use <pre> instead of <code> to make UTF-8 work.)

        ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

        Given your strings, they match fine with or without \b:

        #!/usr/bin/perl -CS use HTML::Entities; my $string = decode_entities <DATA>; $_ = decode_entities "&#1513;&#1508;&#1497;&#1512;&#1488;"; print "matches: '$&'\n" if $string =~ /$_/; print "matches too: '$&'\n" if $string =~ /\b$_\b/; __DATA__ 8^1589-20170113-102647-&#1493;&#1497;&#1495;&#1497;-&#1491;&#1489;&#15 +12;&#1497;_&#1492;&#1505;&#1508;&#1491;_&#1506;&#1500;_&#1492;&#1512; +&#1489;_&#1502;&#1513;&#1492;_&#1513;&#1508;&#1497;&#1512;&#1488;.mp3 +^&#1506;&#1489;&#1512;&#1497;&#1514;^&#1492;&#1512;&#1489; &#1502;&#1 +504;&#1513;&#1492; &#1490;&#1493;&#1500;&#1491;^&#1493;&#1497;&#1495; +&#1497;-&#1491;&#1489;&#1512;&#1497; &#1492;&#1505;&#1508;&#1491; &#1 +506;&#1500; &#1492;&#1512;&#1489; &#1502;&#1513;&#1492; &#1513;&#1508 +;&#1497;&#1512;&#1488;, &#1496;&#1493;' &#1496;&#1489;&#1514;, &#1514 +;&#1513;&#1506;'&#1494;^&#1513;&#1497;&#1506;&#1493;&#1512;&#1497;&#1 +501; &#1489;&#1514;&#1504;"&#1498; &#1493;&#1489;&#1508;&#1512;&#1513 +;&#1514; &#1492;&#1513;&#1489;&#1493;&#1506;|&#1513;&#1497;&#1506;&#1 +493;&#1512;&#1497;&#1501; &#1489;&#1508;&#1512;&#1513;&#1514; &#1492; +&#1513;&#1489;&#1493;&#1506;|&#1513;&#1497;&#1506;&#1493;&#1512;&#149 +7;&#1501; &#1511;&#1493;&#1491;&#1502;&#1497;&#1501;|&#1489;&#1512;&# +1488;&#1513;&#1497;&#1514;|&#1493;&#1497;&#1495;&#1497; __END__

        Output:

        matches: 'שפירא'
        matches too: 'שפירא'
        

        So, no issue with \b and unicode regex here.

        perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'

        Thanks a lot, Monks.

        Knowing that there's no issue wuth \b, I kept investigating. Turned out that one of the strings wasn't really utf8 (for some reason, my terminal insisted on printing it as utf8, though). utf8::decode solved the problem.

Re: \b in Unicode regex
by kennethk (Abbot) on May 22, 2017 at 15:00 UTC
    Do you mean
    $string =~ /$_/; $string =~ /\b$_\b/;
    or do you really mean
    $string =~ /\Q$_\E/; $string =~ /\b\Q$_\E\b/;
    As soon as your variable contains Metacharacters, they are not the same. See quotemeta, Quoting metacharacters.

    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Re: \b in Unicode regex
by kcott (Bishop) on May 23, 2017 at 06:56 UTC

    G'day Arik123,

    Two pieces of information, from perlrebackslash, to note.

    From the "Character classes" section:

    "\w s a character class that matches any single word character (letters, digits, Unicode marks, and connector punctuation (like the underscore))." [my emphasis]

    From the "Assertions" section:

    "\b ... matches at any place between a word (something matched by \w) and a non-word character" [my emphasis again]

    In your reply with actual data, you're effectively trying to match "XXXXX", which occurs in your string as "_XXXXX.". Both '_' and 'X' match "\w": "\b" does not match between '_' and 'X'.

    As already demonstrated twice[1,2], there is no Unicode issue here.

    — Ken

      The string I tried to match (that $_) is actually found twise in $string. In the first time it's indeed preceded by _, but in the second time it's between a space and a ,

      That you all for your time, again.

        I was certain that I checked that before posting my reply; however, I went back and doubled checked just now.

        &#1513;&#1508;&#1497;&#1512;&#1488;

        occurs only once, in the substring

        &#1492;_&#1513;&#1508;&#1497;&#1512;&#1488;.mp3

        We can only comment on the data you show us.

        — Ken

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1190836]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (3)
As of 2021-12-02 19:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    R or B?



    Results (23 votes). Check out past polls.

    Notices?