Re: \b in Unicode regex
by shmem (Chancellor) on May 22, 2017 at 06:52 UTC
|
$s = "hüh-hott"; $t = "hüh";
print "matches: '$&'\n" if $s =~ /$t/;
print "matches too: '$&'\n" if $s =~ /\b$t\b/;
__END__
matches: 'hüh'
matches too: 'hüh'
perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
| [reply] [d/l] [select] |
|
8^1589-20170113-102647-ויחי-דב
+12;י_הספד_על_הר
+ב_משה_שפירא.mp3
+^עברית^הרב מ
+504;שה גולד^ויח
+י-דברי הספד 
+506;ל הרב משה שפ
+;ירא, טו' טבת, ת
+;שע'ז^שיעורי
+501; בתנ"ך ובפרש
+;ת השבוע|שיע
+493;רים בפרשת ה
+שבוע|שיעור•
+7;ם קודמים|בר&#
+1488;שית|ויחי
and $_ is just
שפירא
(it's hebrew, and I'm afraid your broweser might mess up the right-to-left presentation, or even just show the Unicode numbers instead of the characters themselves. My browser makes a mess here. That's why I didn't think posting the strings would help).
| [reply] [d/l] [select] |
|
\b works for me, even with Hebrew:
#! /usr/bin/perl
use warnings;
use strict;
use utf8;
my $string = 'שָׁלוֹם';
print $string =~ /\bש/, "\n";
(I had to use <pre> instead of <code> to make UTF-8 work.)
($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord
}map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
| [reply] [d/l] [select] |
|
#!/usr/bin/perl -CS
use HTML::Entities;
my $string = decode_entities <DATA>;
$_ = decode_entities "שפירא";
print "matches: '$&'\n" if $string =~ /$_/;
print "matches too: '$&'\n" if $string =~ /\b$_\b/;
__DATA__
8^1589-20170113-102647-ויחי-דב
+12;י_הספד_על_הר
+ב_משה_שפירא.mp3
+^עברית^הרב מ
+504;שה גולד^ויח
+י-דברי הספד 
+506;ל הרב משה שפ
+;ירא, טו' טבת, ת
+;שע'ז^שיעורי
+501; בתנ"ך ובפרש
+;ת השבוע|שיע
+493;רים בפרשת ה
+שבוע|שיעור•
+7;ם קודמים|בר&#
+1488;שית|ויחי
__END__
Output:
matches: 'שפירא'
matches too: 'שפירא'
So, no issue with \b and unicode regex here.
perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
| [reply] [d/l] [select] |
|
Thanks a lot, Monks.
Knowing that there's no issue wuth \b, I kept investigating. Turned out that one of the strings wasn't really utf8 (for some reason, my terminal insisted on printing it as utf8, though). utf8::decode solved the problem.
| [reply] |
|
Re: \b in Unicode regex
by kennethk (Abbot) on May 22, 2017 at 15:00 UTC
|
$string =~ /$_/;
$string =~ /\b$_\b/;
or do you really mean
$string =~ /\Q$_\E/;
$string =~ /\b\Q$_\E\b/;
As soon as your variable contains Metacharacters, they are not the same. See quotemeta, Quoting metacharacters.
#11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.
| [reply] [d/l] [select] |
Re: \b in Unicode regex
by kcott (Archbishop) on May 23, 2017 at 06:56 UTC
|
G'day Arik123,
Two pieces of information, from perlrebackslash, to note.
From the "Character classes" section:
"\w s a character class that matches any single word character (letters, digits, Unicode marks, and connector punctuation (like the underscore))."
[my emphasis]
From the "Assertions" section:
"\b ... matches at any place between a word (something matched by \w) and a non-word character"
[my emphasis again]
In your reply with actual data, you're effectively trying to match "XXXXX",
which occurs in your string as "_XXXXX.".
Both '_' and 'X' match "\w": "\b" does not match between '_' and 'X'.
As already demonstrated twice[1,2],
there is no Unicode issue here.
| [reply] [d/l] [select] |
|
The string I tried to match (that $_) is actually found twise in $string. In the first time it's indeed preceded by _, but in the second time it's between a space and a ,
That you all for your time, again.
| [reply] |
|
שפירא
occurs only once, in the substring
ה_שפירא.mp3
We can only comment on the data you show us.
| [reply] [d/l] [select] |
|