http://qs321.pair.com?node_id=870917


in reply to Cleaning up non 7-bit Ascii Chars for XML-processing

You're getting stuff as cp1252 — "’" is 92 in cp1252 — but you're outputting it as is in a document you claim is UTF-8.

Always decode your inputs. Always encode your outputs. You are apparently doing neither.

Note that the quote is character U+2019, so the proper escape is ’ or ’, not \.

If you pass properly decoded text to the following function, it will produce 7-bit clean UTF-8 (aka US-ASCII) XML text and XML attribute values.

sub encode_entities { my ($self, $text) = @_; $text =~ s/&/&amp;/g; $text =~ s/</&lt;/g; $text =~ s/>/&gt;/g; $text =~ s/"/&quot;/g; $text =~ s/'/&#39;/g; $text =~ s/([^\x20-\x7E])/sprintf("&#x%X;", ord($1))/eg; return $text; }