If your goal is to create an XML output whose content is an imperfect and incomplete copy of the original HTML text data (i.e. with an indeterminate amount of corruption due to loss of content), then a "generic approach" for implementing what almut aptly calls the "last resort" solution is a simple regex, applied to the HTML text data:
s/[^\x00-\x7f]+//g;
That is, every byte/character outside the ASCII range will be deleted, regardless whether your perl script happens to be handling the data as bytes or as characters.
A better approach would be to understand what the character encoding of the incoming HTML data really is (and watch out for those HTML character entities that turn into non-ascii characters, like ™ é and so on). Make sure you do everything necessary to turn the text into "pure" utf8 strings (using HTML::Entities::decode_entities), and then output the XML with proper utf8 encoding, or else convert all non-ascii characters to their numeric character entities, as almut suggested above.
There's probably a module for converting characters to numeric entities, but the basic process is:
s/([^\x00-\x7f])/sprintf("&#%d;",ord($1))//eg;
(update: added a missing "#" in the sprintf format string)
But personally, I prefer having XML files with utf8 text in them.
In either case, perl has to know that the string contains utf8 characters, so it can treat it as (multi-byte) characters, rather than as bytes. And that means that you've read the data from a file handle using a ":utf8" IO layer, or that you've used Encode::decode to convert the text to utf8. |