I think you are asking at the wrong place. Try asking the poor souls at http://www.javajunkies.org what encoding are they able to decode and how. If necessary come back and ask how to encode the data into that encoding in Perl then.
Jenda
Always code as if the guy who ends up maintaining your code
will be a violent psychopath who knows where you live.
-- Rick Osborne
| [reply] |
Hi Jenda,
The problem is that i donot know how to find what kind of characters are present on a webpage ( from which i grabbed the data). Is there a rule of thumb regarding the data encoding that i get using useragent->get(). If i can have data in UTF-8 format i can use java ( read junk ;-) ) to decode it.
Ranjan
| [reply] |
I'm no expert on this, but based on my own limited experience (and watching others getting broader experience with it), I'd say it depends on which (human) languages you're dealing with, how many web sites, and which web sites.
If you're talking about sites/pages that are "mostly English with a few funny characters", the pages will often just use HTML entities (e.g. é for é, etc).
Sometimes, the character encoding is specified somewhere in a MIME header or the HTML header, or maybe even in HTML comments. (This is typical if it's an open-standard character set, or a widely-used commercial one, like utf8, iso8859-whatever, Big5, ShiftJIS, GBK, etc.)
Other times (especially when the site is presenting stuff in two or more lanuages on the same page), the character-set info is tucked away in font tags. Worst of all are the Southeast Asian languages (Hindi, Bengali, Tamil, etc) where the font rendering is kinda tough, and various major web sites come up with very different solutions -- i.e. incompatible font encodings -- which means that when you visit one of these sites the first time, you have to download their font in order to read the text. Converting this stuff to any sort of standard character set is a supreme pain in the a**.
Basically, the answer is: there is no general solution -- but if your task is limited to a few sites/languages/character sets, you can get something to work within those bounds.
| [reply] |