in reply to Cleaning up non 7-bit Ascii Chars for XML-processing
You're getting stuff as cp1252 — "’" is 92 in cp1252 — but you're outputting it as is in a document you claim is UTF-8.
Always decode your inputs. Always encode your outputs. You are apparently doing neither.
Note that the quote is character U+2019, so the proper escape is ’ or ’, not \.
If you pass properly decoded text to the following function, it will produce 7-bit clean UTF-8 (aka US-ASCII) XML text and XML attribute values.
sub encode_entities { my ($self, $text) = @_; $text =~ s/&/&/g; $text =~ s/</</g; $text =~ s/>/>/g; $text =~ s/"/"/g; $text =~ s/'/'/g; $text =~ s/([^\x20-\x7E])/sprintf("&#x%X;", ord($1))/eg; return $text; }
In Section
Seekers of Perl Wisdom