Perlian has asked for the wisdom of the Perl Monks concerning the following question:
Hi Friends,
as you may know, there is that code-block »Mathematical Alphanumeric Symbols« U+1D400..U+1D7FF, containing styled letters and digits that look like normal characters from the latin alphabet, just styled in bold or italic available in UniCode.
Now i tried to use a simple transformation operation to transform some normal text into "bold" UniCode text and as naive as i am i did this:
my $CharSet = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ012
+3456789'; # ASCII
my $BoldSet = '𝐚𝐛𝐜𝐝𝐞𝐟&
+#119840;𝐡𝐢𝐣𝐤𝐥𝐦⻐
+7;𝐨𝐩𝐪𝐫𝐬𝐭𝐮
+9855;𝐰𝐱𝐲𝐳𝐀𝐁𝐂&
+#119811;𝐄𝐅𝐆𝐇𝐈𝐉⻍
+8;𝐋𝐌𝐍𝐎𝐏𝐐𝐑
+9826;𝐓𝐔𝐕𝐖𝐗𝐘𝐙&
+#120782;𝟏𝟐𝟑𝟒𝟓𝟔⼮
+9;𝟖𝟗'; # UniCode bold
my $Source = 'The quick brown fox jumps over the lazy dog 1234567890 t
+imes.';
my $Target = $Source;
$Target =~ tr/$CharSet/$BoldSet/;
print "$Source\n$Target\n";
To my surprise, the output was this:
The quick brown fox jumps over the lazy dog 1234567890 times.
Toe quick bdown fox jumps oved toe llzy dog 1234567890 times.
No trace of bold UniCode characters, but some characters garbled.
Does "tr" not work correctly with Unicode?
I have a »use utf8::all;« in my program and i am using this perl version:
This is perl 5, version 26, subversion 3 (v5.26.3) built for x86_64-li
+nux-thread-multi
(with 51 registered patches, see perl -V for more detail)
Thank you very much in advance for your help.
Best regards from Charleston (WV),
Perlian
Re: Transform ASCII into UniCode
by choroba (Cardinal) on Mar 23, 2021 at 08:02 UTC
|
If you want to use tr with dynamic strings (which is NOT the case here), you need to use string eval. Be sure to only use it for validated strings, never a random user input!
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
use utf8;
use open OUT => ':encoding(UTF-8)', ':std';
my $charset = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789';
my $boldset = '𝐚𝐛𝐜𝐝𝐞𝐟𝐠𝐡𝐢𝐣𝐤𝐥𝐦𝐧𝐨𝐩𝐪𝐫𝐬𝐭𝐮𝐯𝐰𝐱𝐲𝐳𝐀𝐁𝐂𝐃𝐄𝐅𝐆𝐇𝐈𝐉𝐊𝐋𝐌𝐍𝐎𝐏𝐐𝐑𝐒𝐓𝐔𝐕𝐖𝐗𝐘𝐙𝟎𝟏𝟐𝟑𝟒𝟓𝟔𝟕𝟖𝟗';
my $source = 'The quick brown fox jumps over the lazy dog 1234567890 times.';
my $target = $source;
eval "\$target =~ tr/$charset/$boldset/";
say for $source, $target;
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] [select] |
|
> Be sure to only use it for validated strings, never a random user input!
Here a generic routine to escape only selected meta-characters.
Escaping any / (or other delimiter) from input should allow to safely apply
eval "\$target =~ tr/$charset/$boldset/";
use v5.12;
use warnings;
use Data::Dump qw/pp dd/;
use Test::More;
sub escape_metas {
my ( $meta,$e ) = @_ ;
$e //= '\\'; # default backslash
my $ee ="\Q$e"; # don't mess my regex
s[ (?|
$ee($ee) # ignore double escapes
|
$ee($meta) # keep single escapes
|
($meta) # escape meta
)
]
[$e$1]xgr;
}
my $e = '\\'; # escape code
my $m = '/'; # to be escaped
for ("$m", "$e$e$m", "$e$e$e$e$m" ) {
my $got = escape_metas($m,$e);
is( $got, "$e$_" , "escaping $_ -> $got");
}
for ("$e$m", "$e$e$e$m" ) {
my $got = escape_metas($m,$e);
is( $got, $_ , "ignoring $_ eq $got");
}
done_testing;
C:/Strawberry/perl/bin\perl.exe -w d:/tmp/pm/escapism.pl
ok 1 - escaping / -> \/
ok 2 - escaping \\/ -> \\\/
ok 3 - escaping \\\\/ -> \\\\\/
ok 4 - ignoring \/ eq \/
ok 5 - ignoring \\\/ eq \\\/
1..5
Please tell me if I missed a case, tried to write it as generic as possible.
EDIT
More or betters tests are welcome too. =)
| [reply] [d/l] [select] |
|
I'm probably too busy today to understand. We wanted to escape the strings so they can be used in a transliteration, right? Why not test it directly, then?
sub use_it {
my ($string, $search, $replace) = @_;
my ($s, $r);
$s = escape_metas('/', '\\') for $search;
$r = escape_metas('/', '\\') for $replace;
return eval "\$string =~ tr/$s/$r/r"
}
sub cheat {
my ($string, $search, $replace) = @_;
return eval "\$string =~ tr|\Q$search\E|\Q$replace\E|r"
}
sub simulate {
my ($string, $search, $replace) = @_;
my $result = $string;
for my $i (0 .. length($search) - 1) {
my $from = substr $search, $i, 1;
my $to = substr $replace, $i, 1;
$result =~ s/\Q$from/$to/g;
}
return $result
}
for my $case (
# String search replace expect
['a/b' => 'a/b', 'xyz', 'xyz'],
['a\\b' => 'a\\b', 'xyz', 'xyz'],
['a/b' => '\\/', 'xy', 'ayb'],
['a\\/b' => '\\/', 'xy', 'axyb'],
['a/\\b' => '\\/', 'xy', 'ayxb'],
['a\\\\b' => '\\/', 'xy', 'axxb'],
['a\\\\/b' => '\\/', 'xy', 'axxyb'],
) {
is simulate(@$case), $case->[-1], 'simulate';
is cheat(@$case), simulate(@$case), 'cheat';
is use_it(@$case), simulate(@$case), 'use';
}
I'm not sure I got the "expect" right, but both "simulate" and "cheat" give the same results. "use", on the other hand, doesn't. I based it on your escape_metas - what did I do wrong?
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] [select] |
|
|
|
Re: Transform ASCII into UniCode
by BillKSmith (Monsignor) on Mar 23, 2021 at 03:32 UTC
|
From the documentation of tr:
Characters may be literals, or (if the delimiters aren't single quotes) any of the escape sequences accepted in double-quoted strings. But there is never any variable interpolation, so "$" and "@" are always treated as literals.
| [reply] |
Re: Transform ASCII into UniCode
by GrandFather (Saint) on Mar 23, 2021 at 04:05 UTC
|
use strict;
use warnings;
use Encode;
binmode *STDOUT, 'utf8'; # Suppress "wide character" warnings
my $CharSet = 'a'; # ASCII
my $BoldSet = pack('U', 119834); # Unicode bold 'a'
my $Source = 'a';
my $trTarget = $Source;
my $reTarget = $Source;
$trTarget =~ tr/$CharSet/$BoldSet/;
$reTarget =~ s/$CharSet/$BoldSet/;
print "$Source\n$trTarget\n$reTarget\n";
print $BoldSet;
Prints:
a
l
𝐚
𝐚
It seems tr/// isn't the right tool for the job. :-(
Update: PerlMonks is screwing up the unicode characters. They render correctly when I paste them into the edit window, but are shown as code points when I submit the edit. Bugger.
Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
| [reply] [d/l] [select] |
|
| [reply] |
Re: Transform ASCII into UniCode
by kcott (Archbishop) on Mar 24, 2021 at 19:48 UTC
|
G'day Perlian,
Here's a generic technique for dealing with this type of problem which doesn't require listing every character.
$ perl -Mutf8 -C -E '
my ($offset_0, $offset_A, $offset_a)
= (ord("𝟎")-ord("0"), ord("𝐀")-ord("A"), ord("𝐚")-ord("a"));
say "The quick brown fox jumps over the lazy dog 1234567890 times."
=~ s/([0-9])/chr(ord($1)+$offset_0)/egr
=~ s/([A-Z])/chr(ord($1)+$offset_A)/egr
=~ s/([a-z])/chr(ord($1)+$offset_a)/egr;
'
𝐓𝐡𝐞 𝐪𝐮𝐢𝐜𝐤 𝐛𝐫𝐨𝐰𝐧 𝐟𝐨𝐱 𝐣𝐮𝐦𝐩𝐬 𝐨𝐯𝐞𝐫 𝐭𝐡𝐞 𝐥𝐚𝐳𝐲 𝐝𝐨𝐠 𝟏𝟐𝟑𝟒𝟓𝟔𝟕𝟖𝟗𝟎 𝐭𝐢𝐦𝐞𝐬.
This should work fine with your 5.26.3 (I'm using 5.32.0).
As general information: say requires 5.10 and /r requires 5.14.
Two caveats:
-
Different Perl versions support different Unicode® versions:
check you have a sufficiently high version of Perl to handle the Unicode characters you want to output
(if in doubt, check the deltas).
-
Some alphabetical sequences in [PDF]
"Mathematical Alphanumeric Symbols"
have missing characters because they were defined in earlier versions.
The first example in that block is U+1D44E (𝑎) to U+1D467 (𝑧)
which has U+1D455 (<reserved>) because U+210E (ℎ)
was already defined in [PDF]
"Letterlike Symbols" as PLANCK CONSTANT.
Here's another example to show the generality of the solution.
Only three characters were changed in the code to produce completely different output.
$ perl -Mutf8 -C -E '
my ($offset_0, $offset_A, $offset_a)
= (ord("𝟘")-ord("0"), ord("𝕬")-ord("A"), ord("𝖆")-ord("a"));
say "The quick brown fox jumps over the lazy dog 1234567890 times."
=~ s/([0-9])/chr(ord($1)+$offset_0)/egr
=~ s/([A-Z])/chr(ord($1)+$offset_A)/egr
=~ s/([a-z])/chr(ord($1)+$offset_a)/egr;
'
𝕿𝖍𝖊 𝖖𝖚𝖎𝖈𝖐 𝖇𝖗𝖔𝖜𝖓 𝖋𝖔𝖝 𝖏𝖚𝖒𝖕𝖘 𝖔𝖛𝖊𝖗 𝖙𝖍𝖊 𝖑𝖆𝖟𝖞 𝖉𝖔𝖌 𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡𝟘 𝖙𝖎𝖒𝖊𝖘.
| [reply] [d/l] [select] |
Re: Transform ASCII into UniCode
by Perlian (Initiate) on Mar 23, 2021 at 21:18 UTC
|
Thank you very much for all your answers, @choroba had the correct point: tr takes only literals for both character sets.
Yes there are ways around that by using the `evil' eval, but that is just not necessary in my case:
I just want to write a little function that accepts an ASCII string and returns a "bold" version of it.
And yes, my terminal (MobaXterm) is capable to display a pretty good chunk of the UniCode charset, including the pseudo-bold or -italic block.
Again, thank you all for guiding me back to the path of truth! 😋
Best regards from Charleston (WV),
Perlian | [reply] |
Re: Transform ASCII into UniCode
by Polyglot (Chaplain) on Mar 23, 2021 at 03:52 UTC
|
use utf8;
use Encode qw(encode decode);
my $CharSet = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ012
+3456789'; # ASCII
my $BoldSet = encode('utf8','𝐚𝐛𝐜𝐝w
+838;𝐟𝐠𝐡𝐢𝐣𝐤𝐥&#
+119846;𝐧𝐨𝐩𝐪𝐫𝐬𝐭
+;𝐮𝐯𝐰𝐱𝐲𝐳𝐀w
+809;𝐂𝐃𝐄𝐅𝐆𝐇𝐈&#
+119817;𝐊𝐋𝐌𝐍𝐎𝐏𝐐
+;𝐑𝐒𝐓𝐔𝐕𝐖𝐗w
+832;𝐙𝟎𝟏𝟐𝟑𝟒𝟓&#
+120788;𝟕𝟖𝟗');
my $Source = 'The quick brown fox jumps over the lazy dog 1234567890 t
+imes.';
my $Target = $Source;
$Target =~ tr/abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123
+456789/𝐚𝐛𝐜𝐝𝐞𝐟𝐠
+;𝐡𝐢𝐣𝐤𝐥𝐦𝐧w
+848;𝐩𝐪𝐫𝐬𝐭𝐮𝐯&#
+119856;𝐱𝐲𝐳𝐀𝐁𝐂𝐃
+;𝐄𝐅𝐆𝐇𝐈𝐉𝐊w
+819;𝐌𝐍𝐎𝐏𝐐𝐑𝐒&#
+119827;𝐔𝐕𝐖𝐗𝐘𝐙𝟎
+;𝟏𝟐𝟑𝟒𝟓𝟔𝟕x
+790;𝟗/;
print "$Source\n$Target\n";
#The quick brown fox jumps over the lazy dog 1234567890 times.
#𝐓𝐡𝐞 𝐪𝐮𝐢𝐜w
+844; 𝐛𝐫𝐨𝐰𝐧 𝐟𝐨
+𝐱 𝐣𝐮𝐦𝐩𝐬 𝐨
+9855;𝐞𝐫 𝐭𝐡𝐞 𝐥𝐚
+;𝐳𝐲 𝐝𝐨𝐠 𝟏𝟐
+20785;𝟒𝟓𝟔𝟕𝟖𝟗𝟎
+ 𝐭𝐢𝐦𝐞𝐬.
| [reply] [d/l] |
|
|