in reply to uparse - Parse Unicode strings

Hi Ken!

Tried to find a general solution to the problem reported in Re: uparse - Parse Unicode strings.

Short explanation of the problem:
There are two basic ways to get correct UNICODE input from the elements in @ARGV:

Either may be used, but not both.

A script that expects UNICODE data from @ARGV cannot easily detect if the implicit decoding is in effect, especially because -CAL makes the behaviour locale-dependent.

The best solution I could find is to check if the data in question is already marked to be in UTF-8. Encode::is_utf8 (or the equivalent utf8::is_utf8) may be used to check this flag, which results in a small modification to your script:

diff --git a/uparse b/uparse index f5edb92..b05e12a 100755 --- a/uparse +++ b/uparse @@ -23,11 +23,11 @@ use constant { NO_PRINT => "\N{REPLACEMENT CHARACTER}", }; -use Encode 'decode'; +use Encode qw(decode is_utf8); use Unicode::UCD 'charinfo'; for my $raw_str (@ARGV) { - my $str = decode('UTF-8', $raw_str); + my $str = is_utf8($raw_str) ? $raw_str : decode('UTF-8', $raw_str +); print "\n", SEP1; print "String: '$str'\n"; print SEP1;

What do you think about this?