Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re: \b in Unicode regex

by shmem (Chancellor)
on May 22, 2017 at 06:52 UTC ( #1190837=note: print w/replies, xml ) Need Help??


in reply to \b in Unicode regex

Why don't you write down what $string and $_ contain, so we don't have to guess? See I know what I mean. Why don't you?

$s = "hüh-hott"; $t = "hüh"; print "matches: '$&'\n" if $s =~ /$t/; print "matches too: '$&'\n" if $s =~ /\b$t\b/; __END__ matches: 'hüh' matches too: 'hüh'
perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'

Replies are listed 'Best First'.
Re^2: \b in Unicode regex
by Arik123 (Beadle) on May 22, 2017 at 07:30 UTC

    The actual strings are quite a mess. I just wanted to know whether there's some issue with \b in Unicode. If you insist, then $string is something like

    8^1589-20170113-102647-ויחי-דב&#15 +12;י_הספד_על_הר +ב_משה_שפירא.mp3 +^עברית^הרב מ&#1 +504;שה גולד^ויח +י-דברי הספד &#1 +506;ל הרב משה ש&#1508 +;ירא, טו' טבת, &#1514 +;שע'ז^שיעורי&#1 +501; בתנ"ך ובפר&#1513 +;ת השבוע|שיע&#1 +493;רים בפרשת ה +שבוע|שיעור&#149 +7;ם קודמים|בר&# +1488;שית|ויחי

    and $_ is just

    שפירא

    (it's hebrew, and I'm afraid your broweser might mess up the right-to-left presentation, or even just show the Unicode numbers instead of the characters themselves. My browser makes a mess here. That's why I didn't think posting the strings would help).

      \b works for me, even with Hebrew:
      #! /usr/bin/perl
      use warnings;
      use strict;
      use utf8;
      my $string = 'שָׁלוֹם';
      print $string =~ /\bש/, "\n";
      

      (I had to use <pre> instead of <code> to make UTF-8 work.)

      ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

      Given your strings, they match fine with or without \b:

      #!/usr/bin/perl -CS use HTML::Entities; my $string = decode_entities <DATA>; $_ = decode_entities "&#1513;&#1508;&#1497;&#1512;&#1488;"; print "matches: '$&'\n" if $string =~ /$_/; print "matches too: '$&'\n" if $string =~ /\b$_\b/; __DATA__ 8^1589-20170113-102647-&#1493;&#1497;&#1495;&#1497;-&#1491;&#1489;&#15 +12;&#1497;_&#1492;&#1505;&#1508;&#1491;_&#1506;&#1500;_&#1492;&#1512; +&#1489;_&#1502;&#1513;&#1492;_&#1513;&#1508;&#1497;&#1512;&#1488;.mp3 +^&#1506;&#1489;&#1512;&#1497;&#1514;^&#1492;&#1512;&#1489; &#1502;&#1 +504;&#1513;&#1492; &#1490;&#1493;&#1500;&#1491;^&#1493;&#1497;&#1495; +&#1497;-&#1491;&#1489;&#1512;&#1497; &#1492;&#1505;&#1508;&#1491; &#1 +506;&#1500; &#1492;&#1512;&#1489; &#1502;&#1513;&#1492; &#1513;&#1508 +;&#1497;&#1512;&#1488;, &#1496;&#1493;' &#1496;&#1489;&#1514;, &#1514 +;&#1513;&#1506;'&#1494;^&#1513;&#1497;&#1506;&#1493;&#1512;&#1497;&#1 +501; &#1489;&#1514;&#1504;"&#1498; &#1493;&#1489;&#1508;&#1512;&#1513 +;&#1514; &#1492;&#1513;&#1489;&#1493;&#1506;|&#1513;&#1497;&#1506;&#1 +493;&#1512;&#1497;&#1501; &#1489;&#1508;&#1512;&#1513;&#1514; &#1492; +&#1513;&#1489;&#1493;&#1506;|&#1513;&#1497;&#1506;&#1493;&#1512;&#149 +7;&#1501; &#1511;&#1493;&#1491;&#1502;&#1497;&#1501;|&#1489;&#1512;&# +1488;&#1513;&#1497;&#1514;|&#1493;&#1497;&#1495;&#1497; __END__

      Output:

      matches: 'שפירא'
      matches too: 'שפירא'
      

      So, no issue with \b and unicode regex here.

      perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'

      Thanks a lot, Monks.

      Knowing that there's no issue wuth \b, I kept investigating. Turned out that one of the strings wasn't really utf8 (for some reason, my terminal insisted on printing it as utf8, though). utf8::decode solved the problem.

        You actually had the opposite problem: You had UTF-8, but the regex engine expects a string of Unicode Code Points[1]. utf8::decode provides the latter from the former.


        1. More specifically, it's \w, \b, \d, etc that are defined in terms of UCP.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1190837]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (4)
As of 2022-01-24 12:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    In 2022, my preferred method to securely store passwords is:












    Results (64 votes). Check out past polls.

    Notices?