http://qs321.pair.com?node_id=387639

drhender has asked for the wisdom of the Perl Monks concerning the following question:

I am using perl to parse html and convert it to XML. Slowly, over time, I have built up a table of HTML special entities (©,  , £, etc.) that I have to convert to hex values before putting them in the XML. Does anyone know if there's a module lying around somewhere that would do that conversion for me, or should I just still to use the look up table?
  • Comment on Converting HTML special entities to XML

Replies are listed 'Best First'.
Re: Converting HTML special entities to XML
by Aristotle (Chancellor) on Sep 01, 2004 at 18:37 UTC
      I think it is better to translate them to character references. The entities can't be represented accurately other than with Unicode. The HTML entity resolver would need to produce UTF-8 strings.

      This assumes that the HTMl to XML process is converting escaped text to escaped text. If the text is being unescaped for other reasons, then the entities should be expanded to UTF-8 and escaped on output.

        They should always be expanded to UTF-8 and escaped on output. Your HTML parser should just give you Unicode, and whatever XML generator you use should be escaping it automatically for you as appropriate for the target encoding.

        Don't attempt to transcode entities and what manually to insert literal bytes into the output XML stream. That way lies madness (and a lot of buggy code; most code dealing with XML out there is quite broken with regard to encodings).

        Makeshifts last the longest.

Re: Converting HTML special entities to XML
by iburrell (Chaplain) on Sep 01, 2004 at 22:32 UTC
    Look at the entity declarations in the XHTML or HTML specs. Those are what real SGML/XML processors use to translate the entities into character references.

    http://www.w3.org/TR/xhtml1/#h-A2 has links to the DTD files for XHTML1.