http://qs321.pair.com?node_id=986739

nvivek has asked for the wisdom of the Perl Monks concerning the following question:

Dear friends,

I am using XML::Simple->XMLin function for reading data from XML file and converting into perl hash. Whenever XML data contains control characters in it, I am receiving above error. How to solve that error to convert the XML data to hash. I tried setting <?xml version="1.0" encoding="iso-8859-1"?> and <?xml version="1.0" encoding="UTF-8"?>, but still I am unable to parse the XML data to a perl hash. Kindly anyone suggest me a solution.

MY XML DATA

<EVENT> <CALLDETAILS> <STATIONID>01</STATIONID> <CALLSESSIONID>00000000020712130852059</CALLSESSIONID> <EXTENSIONNO>8143</EXTENSIONNO> <ZIVAHCHANNELID>172.16.39.88</ZIVAHCHANNELID> <SUBCHANNELID>0</SUBCHANNELID> <AGENTID>NULL</AGENTID> <CALLERID><A0>jW<B7>h<AE><F5><BF><8A>7a<B7><D8>T<D9>^N</CALLER +ID> <CALLEEID>NULL</CALLEEID> <CALLTYPE>IN</CALLTYPE> <RINGCOUNT>1</RINGCOUNT> <CALLTERMSTATUS>NO_CTI_DATA</CALLTERMSTATUS> </CALLDETAILS> </EVENT>
  • Comment on XML::Simple parser error : Input is not proper UTF-8, indicate encoding
  • Download Code

Replies are listed 'Best First'.
Re: XML::Simple parser error : Input is not proper UTF-8, indicate encoding
by daxim (Curate) on Aug 10, 2012 at 13:32 UTC
    XML is not a container format for just any data. You can only put certain characters in it!

    I assume that in your weird notation

    <CALLERID><A0>jW<B7>h<AE><F5><BF><8A>7a<B7><D8>T<D9>^N</CALLER +ID>

    A0 etc in angles is equivalent to \xA0 in Perl, and ^N is equivalent to \cN in Perl. That last character (chr(14) == #x0E == U+000E) is illegal in XML.

    You must encode binary data for XML, Base64 is a common choice.

      I solved the problem by replacing non-printable characters to nil, before writing to XML file. My regular expression is as follows.
      # following is a code to remove non-printable characters in string i +ncluding newline s/[^[:print:]]+//g; # this pattern won't remove newline char s/([\x00-\x09]+)|([\x0B-\x1F]+)//g;
Re: XML::Simple parser error : Input is not proper UTF-8, indicate encoding
by BrowserUk (Patriarch) on Aug 10, 2012 at 13:32 UTC

    Your data isn't valid XML. The only control characters characters allowed are tab, cr and lf.

    You'd need to wrap your callerid data in CDATA tags; or encode them in entity format: Eg. &#xa0; before an XML parser will process it.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      Character reference encoding does not help at all. The character itself is illegal, not its representation.
      $ echo '<root>&#x0e;</root>' | xmllint - -:1: parser error : xmlParseCharRef: invalid xmlChar value 14 <root>&#x0e;</root> ^

      Likewise CDATA is unsuitable:

      $ perl -e'print "<root><![CDATA[\x{0e}]]></root>"' | xmllint - -:1: parser error : PCDATA invalid Char value 14 <root><![CDATA[]]></root> ^

        Hm....maybe you need to update your copy of xmlint?

        "XML 1.1 extends the set of allowed characters to include all the above, plus the remaining characters in the range U+0001–U+001F. At the same time, however, it restricts the use of C0 and C1 control characters other than U+0009, U+000A, U+000D, and U+0085 by requiring them to be written in escaped form (for example U+0001 must be written as &#x01; or its equivalent). In the case of C1 characters, this restriction is a backwards incompatibility; it was introduced to allow common encoding errors to be detected."

        From what I can make out; having an encoding header is both obligatory, and required to make sense of how entities should be interpreted.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?