Transform ASCII into UniCode

Perlian has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Transform ASCII into UniCode by choroba (Cardinal) on Mar 23, 2021 at 08:02 UTC
If you want to use `tr` with dynamic strings (which is NOT the case here), you need to use string eval. Be sure to only use it for validated strings, never a random user input! #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; use utf8; use open OUT => ':encoding(UTF-8)', ':std'; my $charset = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'; my $boldset = '𝐚𝐛𝐜𝐝𝐞𝐟𝐠𝐡𝐢𝐣𝐤𝐥𝐦𝐧𝐨𝐩𝐪𝐫𝐬𝐭𝐮𝐯𝐰𝐱𝐲𝐳𝐀𝐁𝐂𝐃𝐄𝐅𝐆𝐇𝐈𝐉𝐊𝐋𝐌𝐍𝐎𝐏𝐐𝐑𝐒𝐓𝐔𝐕𝐖𝐗𝐘𝐙𝟎𝟏𝟐𝟑𝟒𝟓𝟔𝟕𝟖𝟗'; my $source = 'The quick brown fox jumps over the lazy dog 1234567890 times.'; my $target = $source; eval "\$target =~ tr/$charset/$boldset/"; say for $source, $target; `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l] [select]
Re^2: Transform ASCII into UniCode (escape_metas) by LanX (Saint) on Mar 23, 2021 at 16:43 UTC
> Be sure to only use it for validated strings, never a random user input! Here a generic routine to escape only selected meta-characters. Escaping any / (or other delimiter) from input should allow to safely apply `eval "\$target =~ tr/$charset/$boldset/";` use v5.12; use warnings; use Data::Dump qw/pp dd/; use Test::More; sub escape_metas { my ( $meta,$e ) = @_ ; $e //= '\\'; # default backslash my $ee ="\Q$e"; # don't mess my regex s[ (?\| $ee($ee) # ignore double escapes \| $ee($meta) # keep single escapes \| ($meta) # escape meta ) ] [$e$1]xgr; } my $e = '\\'; # escape code my $m = '/'; # to be escaped for ("$m", "$e$e$m", "$e$e$e$e$m" ) { my $got = escape_metas($m,$e); is( $got, "$e$_" , "escaping $_ -> $got"); } for ("$e$m", "$e$e$e$m" ) { my $got = escape_metas($m,$e); is( $got, $_ , "ignoring $_ eq $got"); } done_testing; [download] `C:/Strawberry/perl/bin\perl.exe -w d:/tmp/pm/escapism.pl ok 1 - escaping / -> \/ ok 2 - escaping \\/ -> \\\/ ok 3 - escaping \\\\/ -> \\\\\/ ok 4 - ignoring \/ eq \/ ok 5 - ignoring \\\/ eq \\\/ 1..5` [download] Please tell me if I missed a case, tried to write it as generic as possible. EDIT More or betters tests are welcome too. =) Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re^3: Transform ASCII into UniCode (escape_metas) by choroba (Cardinal) on Mar 23, 2021 at 17:24 UTC
I'm probably too busy today to understand. We wanted to escape the strings so they can be used in a transliteration, right? Why not test it directly, then? sub use_it { my ($string, $search, $replace) = @_; my ($s, $r); $s = escape_metas('/', '\\') for $search; $r = escape_metas('/', '\\') for $replace; return eval "\$string =~ tr/$s/$r/r" } sub cheat { my ($string, $search, $replace) = @_; return eval "\$string =~ tr\|\Q$search\E\|\Q$replace\E\|r" } sub simulate { my ($string, $search, $replace) = @_; my $result = $string; for my $i (0 .. length($search) - 1) { my $from = substr $search, $i, 1; my $to = substr $replace, $i, 1; $result =~ s/\Q$from/$to/g; } return $result } for my $case ( # String search replace expect ['a/b' => 'a/b', 'xyz', 'xyz'], ['a\\b' => 'a\\b', 'xyz', 'xyz'], ['a/b' => '\\/', 'xy', 'ayb'], ['a\\/b' => '\\/', 'xy', 'axyb'], ['a/\\b' => '\\/', 'xy', 'ayxb'], ['a\\\\b' => '\\/', 'xy', 'axxb'], ['a\\\\/b' => '\\/', 'xy', 'axxyb'], ) { is simulate(@$case), $case->[-1], 'simulate'; is cheat(@$case), simulate(@$case), 'cheat'; is use_it(@$case), simulate(@$case), 'use'; } [download] I'm not sure I got the "expect" right, but both "simulate" and "cheat" give the same results. "use", on the other hand, doesn't. I based it on your escape_metas - what did I do wrong? `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l] [select]
Re^4: Transform ASCII into UniCode (escape_metas) by LanX (Saint) on Mar 23, 2021 at 18:18 UTC
Re^5: Transform ASCII into UniCode (escape_metas) by LanX (Saint) on Mar 23, 2021 at 19:08 UTC
Some notes below your chosen depth have not been shown here
Re: Transform ASCII into UniCode by BillKSmith (Monsignor) on Mar 23, 2021 at 03:32 UTC
From the documentation of tr: Characters may be literals, or (if the delimiters aren't single quotes) any of the escape sequences accepted in double-quoted strings. But there is never any variable interpolation, so "$" and "@" are always treated as literals. Bill	[reply]
Re: Transform ASCII into UniCode by GrandFather (Saint) on Mar 23, 2021 at 04:05 UTC
a comment rather than an answer. Consider: `use strict; use warnings; use Encode; binmode STDOUT, 'utf8'; # Suppress "wide character" warnings my $CharSet = 'a'; # ASCII my $BoldSet = pack('U', 119834); # Unicode bold 'a' my $Source = 'a'; my $trTarget = $Source; my $reTarget = $Source; $trTarget =~ tr/$CharSet/$BoldSet/; $reTarget =~ s/$CharSet/$BoldSet/; print "$Source\n$trTarget\n$reTarget\n"; print $BoldSet;` [download] Prints: `a l 𝐚 𝐚` [download] It seems tr/// isn't the right tool for the job. :-( Update:* PerlMonks is screwing up the unicode characters. They render correctly when I paste them into the edit window, but are shown as code points when I submit the edit. Bugger. Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond	[reply] [d/l] [select]
Re^2: Transform ASCII into UniCode by Anonymous Monk on Mar 23, 2021 at 09:00 UTC
Update: PerlMonks is screwing up the unicode characters. They render correctly when I paste them into the edit window, but are shown as code points when I submit the edit. Bugger. Perlmonks doesn't unicode, perlmonks does windows-1252, your browser does conversion to windows-1252 ... and at some point html entities are used ...	[reply]
Re: Transform ASCII into UniCode by kcott (Archbishop) on Mar 24, 2021 at 19:48 UTC
G'day Perlian, Here's a generic technique for dealing with this type of problem which doesn't require listing every character. $ perl -Mutf8 -C -E ' my ($offset_0, $offset_A, $offset_a) = (ord("𝟎")-ord("0"), ord("𝐀")-ord("A"), ord("𝐚")-ord("a")); say "The quick brown fox jumps over the lazy dog 1234567890 times." =~ s/([0-9])/chr(ord($1)+$offset_0)/egr =~ s/([A-Z])/chr(ord($1)+$offset_A)/egr =~ s/([a-z])/chr(ord($1)+$offset_a)/egr; ' 𝐓𝐡𝐞 𝐪𝐮𝐢𝐜𝐤 𝐛𝐫𝐨𝐰𝐧 𝐟𝐨𝐱 𝐣𝐮𝐦𝐩𝐬 𝐨𝐯𝐞𝐫 𝐭𝐡𝐞 𝐥𝐚𝐳𝐲 𝐝𝐨𝐠 𝟏𝟐𝟑𝟒𝟓𝟔𝟕𝟖𝟗𝟎 𝐭𝐢𝐦𝐞𝐬. This should work fine with your `5.26.3` (I'm using `5.32.0`). As general information: `say` requires `5.10` and `/r` requires `5.14`. Two caveats: Different Perl versions support different Unicode® versions: check you have a sufficiently high version of Perl to handle the Unicode characters you want to output (if in doubt, check the deltas). Some alphabetical sequences in [PDF] "Mathematical Alphanumeric Symbols" have missing characters because they were defined in earlier versions. The first example in that block is `U+1D44E` (`𝑎`) to `U+1D467` (`𝑧`) which has `U+1D455` (`<reserved>`) because `U+210E` (`ℎ`) was already defined in [PDF] "Letterlike Symbols" as `PLANCK CONSTANT`. Here's another example to show the generality of the solution. Only three characters were changed in the code to produce completely different output. $ perl -Mutf8 -C -E ' my ($offset_0, $offset_A, $offset_a) = (ord("𝟘")-ord("0"), ord("𝕬")-ord("A"), ord("𝖆")-ord("a")); say "The quick brown fox jumps over the lazy dog 1234567890 times." =~ s/([0-9])/chr(ord($1)+$offset_0)/egr =~ s/([A-Z])/chr(ord($1)+$offset_A)/egr =~ s/([a-z])/chr(ord($1)+$offset_a)/egr; ' 𝕿𝖍𝖊 𝖖𝖚𝖎𝖈𝖐 𝖇𝖗𝖔𝖜𝖓 𝖋𝖔𝖝 𝖏𝖚𝖒𝖕𝖘 𝖔𝖛𝖊𝖗 𝖙𝖍𝖊 𝖑𝖆𝖟𝖞 𝖉𝖔𝖌 𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡𝟘 𝖙𝖎𝖒𝖊𝖘. — Ken	[reply] [d/l] [select]
Re: Transform ASCII into UniCode by Perlian (Initiate) on Mar 23, 2021 at 21:18 UTC
Thank you very much for all your answers, @choroba had the correct point: tr takes only literals for both character sets. Yes there are ways around that by using the `evil' eval, but that is just not necessary in my case: I just want to write a little function that accepts an ASCII string and returns a "bold" version of it. And yes, my terminal (MobaXterm) is capable to display a pretty good chunk of the UniCode charset, including the pseudo-bold or -italic block. Again, thank you all for guiding me back to the path of truth! 😋 Best regards from Charleston (WV), Perlian	[reply]
Re: Transform ASCII into UniCode by Polyglot (Chaplain) on Mar 23, 2021 at 03:52 UTC
use utf8; use Encode qw(encode decode); my $CharSet = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ012 +3456789'; # ASCII my $BoldSet = encode('utf8','𝐚𝐛𝐜𝐝&#119 +838;𝐟𝐠𝐡𝐢𝐣𝐤𝐥&# +119846;𝐧𝐨𝐩𝐪𝐫𝐬&#119853 +;𝐮𝐯𝐰𝐱𝐲𝐳𝐀&#119 +809;𝐂𝐃𝐄𝐅𝐆𝐇𝐈&# +119817;𝐊𝐋𝐌𝐍𝐎𝐏&#119824 +;𝐑𝐒𝐓𝐔𝐕𝐖𝐗&#119 +832;𝐙𝟎𝟏𝟐𝟑𝟒𝟓&# +120788;𝟕𝟖𝟗'); my $Source = 'The quick brown fox jumps over the lazy dog 1234567890 t +imes.'; my $Target = $Source; $Target =~ tr/abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123 +456789/𝐚𝐛𝐜𝐝𝐞𝐟&#119840 +;𝐡𝐢𝐣𝐤𝐥𝐦𝐧&#119 +848;𝐩𝐪𝐫𝐬𝐭𝐮𝐯&# +119856;𝐱𝐲𝐳𝐀𝐁𝐂&#119811 +;𝐄𝐅𝐆𝐇𝐈𝐉𝐊&#119 +819;𝐌𝐍𝐎𝐏𝐐𝐑𝐒&# +119827;𝐔𝐕𝐖𝐗𝐘𝐙&#120782 +;𝟏𝟐𝟑𝟒𝟓𝟔𝟕&#120 +790;𝟗/; print "$Source\n$Target\n"; #The quick brown fox jumps over the lazy dog 1234567890 times. #𝐓𝐡𝐞 𝐪𝐮𝐢𝐜&#119 +844; 𝐛𝐫𝐨𝐰𝐧 𝐟𝐨 +𝐱 𝐣𝐮𝐦𝐩𝐬 𝐨&#11 +9855;𝐞𝐫 𝐭𝐡𝐞 𝐥&#119834 +;𝐳𝐲 𝐝𝐨𝐠 𝟏𝟐&#1 +20785;𝟒𝟓𝟔𝟕𝟖𝟗𝟎 + 𝐭𝐢𝐦𝐞𝐬. [download] Blessings, ~Polyglot~	[reply] [d/l]


Welcome to the Monastery
	PerlMonks