Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

UTF8 Output with XML::Feed?

by mldvx4 (Friar)
on Mar 07, 2022 at 14:23 UTC ( [id://11141885] : perlquestion . print w/replies, xml ) Need Help??

mldvx4 has asked for the wisdom of the Perl Monks concerning the following question:

Hello. I would like to have the XML::Feed produce UTF-8 but after experimenting a bit and searching a great many web pages, I feel stumped -- again. :/

Below is a sample which generates escaped characters in the feed title instead of showing UTF8. I would like it to generate proper "едц" characters instead of "åäö" it currently makes.

#!/usr/bin/perl use open ':encoding(utf8)'; use XML::Feed; use English; use strict; use warnings; my $feed = XML::Feed->new('RSS'); $feed->title('Feed'); $feed->link('https://www.example.com/feed.rss'); $feed->language('en'); $feed->description('Feed from a to ц'); my $entry = XML::Feed::Entry->new(); $entry->link('https://www.example.com/one.html'); $entry->title('abc...едц'); $feed->add_entry($entry); print $feed->as_xml; exit(0)

Can I pass an open file handle to XML::Feed somehow? Or what is a correct method?

Replies are listed 'Best First'.
Re: UTF8 Output with XML::Feed?
by kcott (Archbishop) on Mar 07, 2022 at 18:07 UTC

    G'day mldvx4,

    Note: I've used this common alias of mine in a couple of places:

    $ alias perlu alias perlu='perl -Mstrict -Mwarnings -Mautodie=:all -Mutf8 -C -E'

    There are two lines in your output that you should note. When I run your code as posted, I get:

    ... <description>Feed from a to &#xC3;&#xB6;</description> ... <title>abc...&#xC3;&#xA5;&#xC3;&#xA4;&#xC3;&#xB6;</title> ...

    When I add use utf8;, I get:

    ... <description>Feed from a to &#xF6;</description> ... <title>abc...&#xC3;&#xA5;&#xC3;&#xA4;&#xC3;&#xB6;</title> ...

    So, that's fixed the $feed->description():

    $ perlu 'say chr hex "F6"' ц

    but not the $entry->title().

    Look at the difference between how you code XML::Feed->new($format) and XML::Feed::Entry->new($format). Aligning those by changing

    my $entry = XML::Feed::Entry->new();

    to

    my $entry = XML::Feed::Entry->new('RSS');

    I now get:

    ... <description>Feed from a to &#xF6;</description> ... <title>abc...&#xE5;&#xE4;&#xF6;</title> ...

    So, both the $feed->description() and $entry->title() are now fixed:

    $ perlu 'say chr hex for qw{E5 E4 F6}' е д ц

    I'll also draw your attention to "XML::Feed: Atom feeds come out as bytes, but RSS as Unicode [rt.cpan.org #43004] #44". I haven't looked into this but it might have some relevance in relation to other XML::Feed work you may be doing.

    — Ken

      my $entry = XML::Feed::Entry->new();
      is equivalent to
      my $entry = XML::Feed::Entry->new('Atom');

      So this appears to be a bug on the Atom side of things.

      And the ticket to which you linked supports that.

Re: UTF8 Output with XML::Feed? (use utf8)
by LanX (Saint) on Mar 07, 2022 at 14:36 UTC
    my guess is that you need to add use utf8;

    It tells Perl to treat the source code as utf8 instead of ASCII and this includes literal strings like 'abc...едц'

    See utf8 for more.

    update

    This

    &#xC3;&#xA5;&#xC3;&#xA4;&#xC3;&#xB6;

    looks very much like use utf8 is missing.

    Without Perl will interpret the multibyte characters as single bytes.

    U+00E5    е    c3 a5    LATIN SMALL LETTER A WITH RING ABOVE

    hence

    &#xC3;&#xA5;

    in HTML encoding of single bytes.

    use utf8

    will activate the utf8 flag for variables populated from literal strings, in order to treat multibytes as character strings.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      Thanks. Adding use utf8; was one of the first things I tried. I've also tried opening stdout as :utf8 but that doesn't help either. Adding an additional print() shows that the script itself is handling UTF8, or at least looks like it is, but XML::Feed seems not to.

      #!/usr/bin/perl use utf8; use open ':encoding(utf8)'; use XML::Feed; use English; use strict; use warnings; my $d='Feed from a to ц'; my $t='abc...едц'; my $feed = XML::Feed->new('RSS'); $feed->title('Feed'); $feed->link('https://www.example.com/feed.rss'); $feed->language('en'); $feed->description($d); my $entry = XML::Feed::Entry->new(); $entry->link('https://www.example.com/one.html'); $entry->title($t); $feed->add_entry($entry); print "Description: $d\n"; print "Title: $t\n"; print $feed->as_xml; exit(0)

      For what it's worth, the following appears to produce only a blank line.

      #!/usr/bin/perl use utf8; print "\N{LATIN SMALL LETTER A WITH RING ABOVE}\n";

      The terminal is xfce4-terminal 0.8.10 (Xfce 4.16) and set to use UTF-8. Pressing the keys "едц" appear to show the right characters.

        "For what it's worth, the following appears to produce only a blank line."

        Please go back and (re)read the utf8 documentation; paying particular attention to the very clear and emboldened directive:

        Do not use this pragma for anything else than telling Perl that your script is written in UTF-8.

        The code you presented only contains 7-bit ASCII characters.

        You got what appeared to be a blank line. Here are some things you could have tried:

        $ perl -e 'print "\N{LATIN SMALL LETTER A WITH RING ABOVE}\n";' $ perl -e 'print "|\N{LATIN SMALL LETTER A WITH RING ABOVE}|\n";' | | $ perl -C -e 'print "\N{LATIN SMALL LETTER A WITH RING ABOVE}\n";' е $ perl -e 'use open OUT => qw{:encoding(UTF-8) :std}; print "\N{LATIN +SMALL LETTER A WITH RING ABOVE}\n";' е

        See: perlrun for -C; and, the open pragma.

        — Ken

        I can't comment on XML::Feed, sorry.

        But ...

        > Adding use utf8; was one of the first things I tried.

        ... if your source-code is in utf8 (check your editor settings) and you have a line like my $t='abc...едц'; you must apply use utf8;

        Otherwise Perl will not know how to decode the bytes in that string, because the interpretation is not obvious.

        You should clarify this, before meddling with XML.

        Here a demo you should run:

        use v5.12; use warnings; use Data::Dump; my $t1='едц'; ddx $t1; say "length: ",length $t1; use utf8; my $t2='едц'; ddx $t2; say "length: ",length $t2;
        OUTPUT:
        # demo_utf8.pl:8: "\xC3\xA5\xC3\xA4\xC3\xB6" <-- bytes length: 6 # demo_utf8.pl:14: "\xE5\xE4\xF6" <-- code p +oints length: 3

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery