I've received some old code which was running on Perl
5.8 but stopped to work on
5.10. (both old, so Unicode support is anyhow limited, ie. no
use feature ':5.12') as preamble)
This code should substitute every Unicode/UTF-8 in received (very long) strings (UTF-8) with some escapes listed in a separate file.
This 'dictionary' file is sourced into a hash where the keys are the UTF-8 char while the values are the escaped strings:
ex. $_table{'Ö'}; # gives 'Ö'
A builtin
? is returned for the Unicodes missing in that hash.
The current implementation goes with the approach to convert the strings into bytes:
my $bytes = pack( "C*", unpack( "U0C*", $$sgml_r ));
move into the bytes world
use bytes;
.... #all will happen wrapped here
no bytes;
In this context, it searches for not ASCII bytes (out of the range space/tilde)
$bytes =~ s/([^\ -\~]+.*)/$self->_fixup($1)/ego;
the
_fixup function will perform a check on the length of the non-ASCII sequence (5,4,3,2 bytes) (yes, 6 not considered), looking up in the hash and RECURSIVELY goes through the remaining sequence of bytes. (All the used functions
length/substr/concatenation(.) are then occurring in
bytes context)
I could try to see what exactly went wrong between
v5.8 and
v5.10, but I'm wondering if this approach isn't right in the first place.
There are some good ideas probably, but I spot many steps which are considered and documented as bad practice when working with Unicode in Perl.
I've successfully tested a simple solution which just checks every single Unicode character:
$$sgml_r =~ s/(.)/$self->_mapchar($1)/eg;
where
_mapchar is the function which performs the lookup with conversion for non-ASCII chars
sub _mapchar {
my ($char) = @_;
if ( $char !~ /[\r\n\s]/) {
my $nbytes = length encode_utf8($char);
if ($nbytes > 1) {
$char = exists $_table{$char} ? $_table{$char} : '?';
}
}
return $char;
}
Apart from further testing it, this solution is going to check EVERY char which really doesn't seem good practice too. The rate of strings to process is high and each isn't also that short. Moreover every string could not even have a Unicode out of the ASCII space, or maybe just a few.
Both solutions don't seem fine. Interested to any Perlish consideration from the experts, to better figure out what surely avoid