Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
Sorry to be responding so late on this -- maybe you've already worked out everything I going to say, but I'll say it anyway.
I want to use Encode::from_to(...) to put everything into iso-8859-1 in (probable) good form.
No. If you're expecting to pull in data from various web sites that might use several different single-byte legacy encodings, most of them will not be directly mappable to iso-8859-1. The whole problem with the legacy single-byte encodings is that, to the extent they differ from one another, you cannot map from one to another without losing some characters.

Actually, to the extent that some 8-bit encodings cover fewer displayable characters than others (e.g. iso-8859-* never use 0x80-0x9f for displayable characters, whereas the Windows and Mac code pages always do), loss of information might only happen in one direction. But if your "from" encoding happens to be 8859-2 and your "to" encoding happens to be 8859-1, the conversion simply cannot work.

So, always convert from some non-unicode encoding to utf8. As for guessing correctly from among several 8-bit code pages that cover different latin-alphabet-based languages, the sad truth remains that Encode::Guess will have a hard time getting it right. You need a certain amount of language modeling data (validated by manual inspection and labeling as to language and character set) and some simple statistics on your unknown input data in order to make a proper guess.


In reply to Re: What encoding am I (probably) using? by graff
in thread What encoding am I (probably) using? by tphyahoo

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (3)
As of 2024-04-25 22:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found