Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re: Converting HTML special entities to XML

by Aristotle (Chancellor)
on Sep 01, 2004 at 18:37 UTC ( [id://387641]=note: print w/replies, xml ) Need Help??


in reply to Converting HTML special entities to XML

Don't do that. Decode them using HTML::Entities, preferrably to UTF-8, and stick that in the XML. If your XML documents are not in UTF-8 encoding, your XML generator should entitify automatically characters the encoding cannot natively represent, without requiring any particular care of you. If you're not using an XML generator, you should be.

For good measure, in case you aren't familiar with the subject, read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Makeshifts last the longest.

Replies are listed 'Best First'.
Re^2: Converting HTML special entities to XML
by iburrell (Chaplain) on Sep 01, 2004 at 22:49 UTC
    I think it is better to translate them to character references. The entities can't be represented accurately other than with Unicode. The HTML entity resolver would need to produce UTF-8 strings.

    This assumes that the HTMl to XML process is converting escaped text to escaped text. If the text is being unescaped for other reasons, then the entities should be expanded to UTF-8 and escaped on output.

      They should always be expanded to UTF-8 and escaped on output. Your HTML parser should just give you Unicode, and whatever XML generator you use should be escaping it automatically for you as appropriate for the target encoding.

      Don't attempt to transcode entities and what manually to insert literal bytes into the output XML stream. That way lies madness (and a lot of buggy code; most code dealing with XML out there is quite broken with regard to encodings).

      Makeshifts last the longest.

        It really depends on what kind of processing you are doing. Dealing with the unescaped characters is the safest approach but it requires dealing with charset issues, making sure the output is escaped properly.

        Dealing with the escape text, in its native charset, is simpler. Character references can help because you don't need to worry about character sets for them; they are always Unicode. In fact, they are the safest way to get Unicode characters in a document with all the charset mangling that goes on.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://387641]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (4)
As of 2024-03-29 08:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found