http://qs321.pair.com?node_id=11148731

mldvx4 has asked for the wisdom of the Perl Monks concerning the following question:

Greetings. The following code should show the output "=マリウス" but shows "=xn--caaba8k0b0a7jzpccc.com" instead.

#!/usr/bin/perl use utf8; use URI; use Encode; use strict; use warnings; my $href="https://\x{30de}\x{30ea}\x{30a6}\x{30b9}.com/"; print $href,"\n"; my $uri = URI->new($href); my $domain = $uri->host; print ":",$domain,"\n"; $domain = Encode::decode('utf-8', $domain); print "=",$domain,"\n"; $domain = Encode::encode('utf-8', $domain); print ".",$domain,"\n"; exit(0);

What is a good way to get the variable $domain to contain "マリウス" as UTF-8? I've tried Encode::encode and Encode::decode in several permutations but that is probably not the right way. Is there some way to wrap the URI function in such a way as to have it process Unicode?

ps. This web form has trouble with the Japanese as well and has converted the string to a bunch of HTML entities.

Edit: added use utf8; and redid $href definition.

Replies are listed 'Best First'.
Re: CPAN's URI.pm versus Japanese as Unicode?
by 1nickt (Canon) on Dec 11, 2022 at 13:06 UTC

    Hello,

    You want domain_to_unicode from Net::IDN::Encode, I believe.

    #!/usr/bin/perl
    use utf8;
    use open ':std', ':encoding(UTF-8)';
    use URI;
    use Net::IDN::Encode 'domain_to_unicode';
    use strict;
    use warnings;
    
    my $href="https://マリウス.com/";
    print $href,"\n";
    
    my $uri = URI->new($href);
    
    my $punycode = $uri->host;
    
    print ":",$punycode,"\n";
    
    my $domain = domain_to_unicode($punycode);
    
    print $domain, "\n";
    
    exit(0);
    

    Output:
    https://マリウス.com/
    :xn--gckvb8fzb.com
    マリウス.com
    

    Hope this helps!

    Edit: ++haukex posted while I was composing my reply


    The way forward always starts with a minimal test.

      Thanks for a very clear example. It did help.

Re: CPAN's URI.pm versus Japanse as Unicode?
by haukex (Archbishop) on Dec 11, 2022 at 10:15 UTC

    I see two problems here: first, your source file is not declared as UTF-8 with use utf8;, which means that my $href="https://マリウス.com/"; is actually giving the string "https://\343\203\236\343\203\252\343\202\246\343\202\271.com/". Second, URI is encoding that with Punycode, which IMHO is one correct approach, as the URI documentation states that it works with URIs as per RFC 2396 and RFC 2732, which I think only support US-ASCII.

    If you add the use utf8;, you get the output =xn--gckvb8fzb.com, which is the correct Punycode domain name of "マリウス.com" ("\x{30de}\x{30ea}\x{30a6}\x{30b9}.com").

    What is unclear to me is what your goal is? Why do you (think you) need a URI object with unicode characters in it?

      Thanks, though adding use utf8 does not affect the result perhaps I need to convert from Punycode. Is there a module for converting from Punycode to Unicode? Working with the host names as Punycode is not really an option, as far as a I can tell, because the host name needs to remain human-readable.

      The goal is to extract the host name from the URI and the host name happens to be Japanese as Unicode, as is wont to happen.

        Thanks, though adding use utf8 does not affect the result

        Yes, it does.

        ... the host name needs to remain human-readable. The goal is to extract the host name from the URI and the host name happens to be Japanese as Unicode, ...

        Corion already pointed you to Net::IDN::Encode as one possibility.

        use warnings;
        use strict;
        use utf8;
        use open qw/:std :encoding(UTF-8)/;
        use URI;
        use Net::IDN::Encode qw/domain_to_unicode/;
        
        my $href="https://マリウス.com/";
        my $uri = URI->new($href);
        my $domain = domain_to_unicode($uri->host);
        print $domain,"\n";  # prints "マリウス.com"
        
Re: CPAN's URI.pm versus Japanse as Unicode?
by Corion (Patriarch) on Dec 11, 2022 at 10:12 UTC

      Thanks! That was it, though I needed that nudge from haukex to see it.

      Net::IDN::Encode was able to restore the host name to Unicode. Now I have to find a way to get that into Alpine without damaging a production system.