Regex to encode entities in XML

epoptai has asked for the wisdom of the Perl Monks concerning the following question:

(Edited by epoptai on 10/7/03 in reply to 296401)

I've got a problem with the output of Perlmonks' chatterbox xml ticker. When a high-bit ascii character like 'á' is entered in CB the character is not encoded, it's transmitted with the XML stream in a way that causes XML::Simple to die (as expected when receiving bad xml). It would be best if 'legal' xml were generated by perlmonks, but that's not the case so it needs to be dealt with. I don't know much about this subject, and have been using the following code from jcwren to convert the problem characters into underscore:

$xml =~ s/[\r\n\t]//g;
$xml =~ tr/\x80-\xff/_/;
$xml =~ tr/\x00-\x1f/_/;
[download]

That's very effective, but leaves something to be desired: the character behind the underscore. Since these characters can be detected and underscored, surely they can be detected and encoded properly? I've made many horribly broken attempts to encode these chrs but my lack of knowledge in this area always gets the last laugh.

Recently mirod posted Converting character encodings which includes a regex from XML::TiePYX that gets very close to doing the job, but it only encodes some of the characters, not all. It barfs on ¤ and probably others:

# This is the regex from XML::TiePYX
$xml =~ s{([\xc0-\xc3])(.)}{ my $hi = ord($1); my $lo = ord($2); chr((
+($hi & 0x03) <<6) | ($lo & 0x3F)) }ge;
[download]

I seek an extended version of the XML::TiePYX regex to find and encode the full range of high-bit chrs specified in the first solution. I'd rather not use another module (XML parser or otherwise) for this task.

thanks for your time - epoptai

--
Check out my Perlmonks Related Scripts like framechat, reputer, and xNN.

Comment on Regex to encode entities in XML Select or Download Code

Replies are listed 'Best First'.
Re: Regex to encode entities in XML by mirod (Canon) on Jun 11, 2001 at 10:22 UTC
Generating valid XML for the CB might actuallly be harder than it looks as I am not sure how easy it is to figure the encoding of the messages. The problem you have might be a bug in XML::Parser: If I use the regexp and then HTML::Entities I get the proper result with XML::Parser 2.27 but the wrong one with XML::Parser 2.30 (it looks like characters loose their UTF-8'edness with the latter). The solution is either to use Text::Iconv or the Unicode modules as described in my first post about encodings, or to go module lifting once again and to grab code from XML::DOM: sub safe_encode { my $str= shift; $str =~ s{([\xC0-\xDF].\|[\xE0-\xEF]..\|[\xF0-\xFF]...)} {XmlUtf8Decode ($1)}egs; return $str; } sub XmlUtf8Decode { my ($str, $hex) = @_; my $len = length ($str); my $n; if ($len == 2) { my @n = unpack "C2", $str; $n = (($n[0] & 0x3f) << 6) + ($n[1] & 0x3f); } elsif ($len == 3) { my @n = unpack "C3", $str; $n = (($n[0] & 0x1f) << 12) + (($n[1] & 0x3f) << 6) + ($n[2] & 0 +x3f); } elsif ($len == 4) { my @n = unpack "C4", $str; $n = (($n[0] & 0x0f) << 18) + (($n[1] & 0x3f) << 12) + (($n[2] & 0x3f) << 6) + ($n[3] & 0x3f); } elsif ($len == 1) # just to be complete... { $n = ord ($str); } else { die "bad value [$str] for XmlUtf8Decode"; } $hex ? sprintf ("&#x%x;", $n) : "&#$n;"; } [download] This will encode all non-ascii characters as `&#nnn;` where `nnn` is the code of the character in Unicode. This seems to display properly at least in Opera on Linux. Let me know if this solves your problem.	[reply] [d/l]
Re: Unescaped entities in XML by mr.nick (Chaplain) on Jun 11, 2001 at 02:32 UTC
I know you said that you'd rather not use another module, but considering that it's probably already in use by a module that you use in framechat, think about using URI::Escape. Something like: `$xml=uri_escape($xml,"\x80-\xff");` [download] should preserve the hi-bit characters without making XML::Simple barf, right? (It seems to work correctly with pmchat). Update: Duh, right: HTML entities != HTTP Escaped characters.	[reply] [d/l]
Re: Re: Unescaped entities in XML by epoptai (Curate) on Jun 11, 2001 at 06:02 UTC
Hmm, maybe the title should mention 'encoding' rather than 'escaping'. (done) I need to turn ¶ into ß not %B6 as uri_escape does - and really just want to extend that regex, or use a series of them, to encode all those nasty characters.	[reply]
Re: Regex to encode entities in XML by mirod (Canon) on Jun 11, 2001 at 12:32 UTC
Just to follow-up on this problem: The problem with data coming from a browser is often that XML::Simple cannot load a file because XML::Parser normally expects a UTF-8 encoded document and `die` when fed latin1 characters of HTML entities. My quick'n dirty trick in this case is to add the following at the top of the document: `<?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE CHATTER SYSTEM "dummy.dtd" []>` [download] The first line is the XML declaration, which in this case includes an encoding declaration that tells XML::Parser to accept latin1 characters. You can also use the `ProtocolEncoding` option in XML::Parser to get the same result.This takes care of characters above 127 (note that this will not be of much help if lexicon starts posting Japanese characters in shift-JIS for example) The second line takes care of HTML entities. By declaring a fake Document Type Definition (DTD) we tell the parser that entities might be defined in an external file. The file does not even have to exist, XML::Parser will not try to open it by default, but the effect is that if will not complain about undefined entities. Of course then XML::Parser will happily convert characters above 127 to UTF-8 and we have to resort to tricks to convert them back to latin1, but at least we have loaded the document and we can work with it.	[reply] [d/l]
Re: Regex to encode entities in XML by ChemBoy (Priest) on Jun 12, 2001 at 00:04 UTC
Maybe I'm missing something here, but have you looked at HTML::Entities? If you don't want to include another module, you could probably rip out the necessary guts from it to do what you want, but it's a pretty light-weight module anyway. If God had meant us to fly, he would never have give us the railroads. --Michael Flanders	[reply]
Re: Regex to encode entities in XML by Anonymous Monk on Sep 02, 2009 at 13:38 UTC
Hi there. I was looking for a solution to a similar problem earlier today - complicated by the fact that a string might already have entities in it, but might not. (XML files passed in by external content suppliers - ugh!) I eventually built a solution around regexes, that I thought I'd share in case any body else finds it useful. It should be easy to customise by altering just the first two lines - the first is the text to get replaced, the second the text to replace it with. #NOTE: the first entry here is an extended regex that says, match an & + that is NOT followed by between 2 and 4 word characters and a semico +lon. #This should prevent it from double-encoding entites that already exis +t and hence corrupting the XML. my @entities_bare=qw/&(?!\w{2,4};) " ' < >/; my @entities_encoded=qw/& " ' < >/; sub encode_entities { my $string=shift; #print "trace: in encode_entities\n"; for(my $n=0;$n<scalar @entities_bare;++$n){ #print "encode_entities: searching for ".$entities_bare[$n]." +to replace with ".$entities_encoded[$n]."...\n"; if(not $string=~s/$entities_bare[$n]/$entities_encoded[$n]/g){ #print "encode_entities: WARNING: found no entites for ".$ +entities_bare[$n].".\n"; } } return $string; } [download]	[reply] [d/l]


Syntactic Confectionery Delight
	PerlMonks