Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

HTML entities converted to Non-Latin-1 format...

by vishNugupt (Novice)
on Apr 20, 2008 at 15:44 UTC ( [id://681793]=perlquestion: print w/replies, xml ) Need Help??

vishNugupt has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I am parsing html files using HTML::Parser and it converts some of the general entities to the Non-Latin-1 character set. I need to get rid of these characters. Anyway, I could do that?

Regards,
Atul.
  • Comment on HTML entities converted to Non-Latin-1 format...

Replies are listed 'Best First'.
Re: HTML entities converted to Non-Latin-1 format...
by pc88mxer (Vicar) on Apr 20, 2008 at 16:34 UTC
    HTML::Parser converts general entities to Unicode code-points.

    Can you give us an example of what you want to do with the HTML (i.e. sample HTML input and the output you are trying to achieve)? I'm not sure you just want to get rid of them. But if that's what you really want to do, the following will convert the code-points to latin-1 removing any code-points which are not representable by latin-1:

    use Encode; my $latin1 = encode('iso-8859-1', $code_points, sub { '' });
    However, as I said, I think your problem might be handled better in a different way.
      my problem is I use the extracted text to embed in an XML for further processing using a tool that does not support Unicode and these characters might break that tool. Say for example this is one of the HTMLs I am parsing. (sorry it is big, I have used readmore to hide it by default. Any other way I could hide it? Sorry again for this huge HTML code.) It is converted to the text as follows. one can see the extra blank spaces in there and those are Unicode characters, that might be a problem as I described above. Thanks in advance for any help.

        You don't specify what the destination non-unicode-aware program is and how this data is going to be used, but if you don't want to lose information, you could signal encode to replace those characters with HTML or XML character references (&#NNN; or \x{HHHH}.

        $latin1 = encode('iso-8859-1',$original,Encode::FB_HTMLCREF); OR $latin1 = encode('iso-8859-1',$original,Encode::FB_XMLCREF);
Re: HTML entities converted to Non-Latin-1 format...
by moritz (Cardinal) on Apr 20, 2008 at 18:13 UTC
    $text =~ s/[^\x{00}-\x{ff}]//g; But this is a bad idea sind you'll lose information. Better inform yourself about unicode, and how to handle it in perl.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://681793]
Approved by McDarren
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (3)
As of 2024-04-25 20:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found