HTML entities converted to Non-Latin-1 format...

vishNugupt has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: HTML entities converted to Non-Latin-1 format... by pc88mxer (Vicar) on Apr 20, 2008 at 16:34 UTC
`HTML::Parser` converts general entities to Unicode code-points. Can you give us an example of what you want to do with the HTML (i.e. sample HTML input and the output you are trying to achieve)? I'm not sure you just want to get rid of them. But if that's what you really want to do, the following will convert the code-points to latin-1 removing any code-points which are not representable by latin-1: `use Encode; my $latin1 = encode('iso-8859-1', $code_points, sub { '' });` [download] However, as I said, I think your problem might be handled better in a different way.	[reply] [d/l] [select]
Re^2: HTML entities converted to Non-Latin-1 format... by vishNugupt (Novice) on Apr 20, 2008 at 19:40 UTC
my problem is I use the extracted text to embed in an XML for further processing using a tool that does not support Unicode and these characters might break that tool. Say for example this is one of the HTMLs I am parsing. (sorry it is big, I have used readmore to hide it by default. Any other way I could hide it? Sorry again for this huge HTML code.) Read more... (79 kB) It is converted to the text as follows. Read more... (2 kB) one can see the extra blank spaces in there and those are Unicode characters, that might be a problem as I described above. Thanks in advance for any help.	[reply] [d/l]
Re^3: HTML entities converted to Non-Latin-1 format... by clinton (Priest) on Apr 20, 2008 at 20:52 UTC
You don't specify what the destination non-unicode-aware program is and how this data is going to be used, but if you don't want to lose information, you could signal `encode` to replace those characters with HTML or XML character references (`&#NNN;` or `\x{HHHH}`. `$latin1 = encode('iso-8859-1',$original,Encode::FB_HTMLCREF); OR $latin1 = encode('iso-8859-1',$original,Encode::FB_XMLCREF);` [download]	[reply] [d/l] [select]
Re: HTML entities converted to Non-Latin-1 format... by moritz (Cardinal) on Apr 20, 2008 at 18:13 UTC
`$text =~ s/[^\x{00}-\x{ff}]//g;` But this is a bad idea sind you'll lose information. Better inform yourself about unicode, and how to handle it in perl.	[reply] [d/l]


"be consistent"
	PerlMonks