drhender has asked for the wisdom of the Perl Monks concerning the following question:
I am using perl to parse html and convert it to XML. Slowly, over time, I have built up a table of HTML special entities (©, , £, etc.) that I have to convert to hex values before putting them in the XML. Does anyone know if there's a module lying around somewhere that would do that conversion for me, or should I just still to use the look up table?
Re: Converting HTML special entities to XML
by Aristotle (Chancellor) on Sep 01, 2004 at 18:37 UTC
|
| [reply] |
|
| [reply] |
|
They should always be expanded to UTF-8 and escaped on output. Your HTML parser should just give you Unicode, and whatever XML generator you use should be escaping it automatically for you as appropriate for the target encoding.
Don't attempt to transcode entities and what manually to insert literal bytes into the output XML stream. That way lies madness (and a lot of buggy code; most code dealing with XML out there is quite broken with regard to encodings).
Makeshifts last the longest.
| [reply] |
|
|
Re: Converting HTML special entities to XML
by iburrell (Chaplain) on Sep 01, 2004 at 22:32 UTC
|
| [reply] |
|