Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Decoding unicode entities with HTML::Parser

by Sixtease (Friar)
on Apr 09, 2008 at 07:31 UTC ( [id://679169]=perlquestion: print w/replies, xml ) Need Help??

Sixtease has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks,

HTML::Parser provides the HTML::Entities::_decode_entities method, which is the lower level peer of HTML::Entities::decode. I use it in my XML::Entities module to do the real work. However, it appears that older versions of HTML::Parser don't handle unicode entities.

use HTML::Parser; $x = "ř"; HTML::Entities::_decode_entities($x, {}); print "$x\n";
# outputs "ř" on 3.56
# outputs ř on 3.35

The changelog for HTML::Parser says that by version 3.39_90, the Unicode entities are always treated for perl 5.8+ and that it is "no longer a compile-time directive". However, I found nothing about a directive in the earlier versions.

So, my question is: How can I make the older versions of HTML::Parser treat unicode entities?

use strict; use warnings; print "Just Another Perl Hacker\n";

Replies are listed 'Best First'.
Re: Decoding unicode entities with HTML::Parser
by ikegami (Patriarch) on Apr 09, 2008 at 10:09 UTC

    It turns out that your problem is not based on the version of the module but on how it was compiled.

    The functionality appears to be there in the version with which you have a problem (HTML::Entites 1.27, HTML-Parser 3.35), but it's conditional on UNICODE_ENTITIES being defined when the module was compiled.

    use HTML::Entities qw( _decode_entities ); print(HTML::Entities::UNICODE_SUPPORT(), "\n"); # Not really, but good enough and avoids warning. binmode(STDOUT, ':encoding(UTF-8)'); my $x = "ř"; _decode_entities($x, {}); print("$x\n");
    >this_perl script.pl 1 [some char] >that_perl script.pl 0 ř

    Up to HTML-Parser 3.38, perl Makefile.PL prompted whether UNICODE support was desired or not and would set UNICODE_ENTITIES accordingly.

    In 3.40, UNICODE_ENTITIES was renamed to UNICODE_HTML_PARSER, and it's solely based on the Perl version. (It's defined for Perl 5.8+.)

    There's really not much you can do if the user didn't compile the functionality you want. In this case, there's no way to add a hack, short of rewriting the entire function. You should add a dependency on HTML::Entities::UNICODE_SUPPORT returning true. When it's not, request that the user either recompile the module with UNICODE support or upgrade to at least HTML-Parser 3.40.

      Cool, cool!

      This is exactly what I needed to know. I rewrote the function with a regex (which is a bit slower) and fall back to it when HTML::Entities::UNICODE_SUPPORT is false. Thanks a lot for the insight.

      use strict; use warnings; print "Just Another Perl Hacker\n";
Re: Decoding unicode entities with HTML::Parser
by ikegami (Patriarch) on Apr 09, 2008 at 09:18 UTC

    HTML::Entities provides _decode_entities (not HTML::Parser), so it would be more relevant to include the version of HTML::Entities.

    Update: So this would be a better test case:

    use HTML::Entities qw( decode_entities _decode_entities ); print(HTML::Entities->VERSION(), "\n"); # Not really, but good enough and avoids warning. binmode(STDOUT, ':encoding(UTF-8)'); my $x = "ř"; _decode_entities($x, {}); print("$x\n"); print(decode_entities("ř"), "\n");
    >c:\progs\perl580\bin\perl script.pl 1.23 ř ř >c:\progs\perl588\bin\perl script.pl 1.32 [some char] [some char]

    Update: Oh I see, HTML::Entities is part of HTML-Parser. Best if you just ignore this post.

Re: Decoding unicode entities with HTML::Parser
by Juerd (Abbot) on Apr 09, 2008 at 09:47 UTC

    So, my question is: How can I make the older versions of HTML::Parser treat unicode entities?

    Why do you insist on using an old version? Is the module etched into the motherboard of the computer? New versions are made because there's something wrong or suboptimal about the old version, and in general you're supposed to upgrade in order to take advantage of the changes.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://679169]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (8)
As of 2024-04-18 09:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found