Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re: Perl's encoding versus UTF8 octets

by GrandFather (Saint)
on Jan 13, 2021 at 05:10 UTC ( [id://11126832]=note: print w/replies, xml ) Need Help??


in reply to Perl's encoding versus UTF8 octets

When I run your code from Komodo IDE which understands UTF8 it prints:

Content-Type:text/html; charset=utf-8 Content-Language: utf8; <p>ä <span class="sy">ä</span>, <span class="sy">Ä</span><span clas +s="posg pos">Substantiv, Neutrum, das</span><span class="vg v"> Ä &# +603;&#720; das ä; Genitiv: des ä (umgangssprachlich: -s), ä (umgangss +prachlich: -s) </span></p>

Is that not what you expected to see? Maybe the terminal you are using doesn't understand UTF8?

Update: note that you don't need use utf8;. That is only required if you want to use UTF8 in your source code. You don't do that, you create a string containing UTF8 characters, but the source code is pure 7 bit ASCII.

Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond

Replies are listed 'Best First'.
Re^2: Perl's encoding versus UTF8 octets
by Polyglot (Chaplain) on Jan 13, 2021 at 05:27 UTC

    Yes, my code does send that to the browser, as I noted above. I just haven't found a way to send the converted text to a file. Unfortunately, the file sizes I'm working with are well above 100 MB, and it's simply not practical to try to run it all through the browser, click on "view source," copy it out, and paste it into a text file. In fact, even TextWrangler chokes on these file sizes already. I'm trying to convert them before pumping them into a database.

    Blessings,

    ~Polyglot~

      What is actually stored in your files? The literal text you provided in the string, or something else? If it is the text in the string then you can:

      use strict; use warnings; use Encode; binmode STDOUT, ':utf8'; print "Content-Type:text/html; charset=utf-8\n"; print "Content-Language: utf8;\n\n"; my $asText = do {local $/; <DATA>}; $asText =~ s!\\x(..)!chr(hex($1))!ge; my $uCode; my $newcode = decode('utf8', $asText); print "<p>$newcode</p>\n"; __DATA__ \xc3\xa4 <span class="sy">\xc3\xa4</span>, <span class="sy">\xc3\x84</span> <span class="posg pos">Substantiv, Neutrum, das</span> <span class="vg v"> \xc3\x84 \xc9\x9b\xcb\x90 das \xc3\xa4; Genitiv: +des \xc3\xa4 (umgangssprachlich: -s), \xc3\xa4 (umgangssprachlich: -s +) </span>

      Prints:

      Content-Type:text/html; charset=utf-8 Content-Language: utf8; <p>ä <span class="sy">ä</span>, <span class="sy">Ä</span> <span class="posg pos">Substantiv, Neutrum, das</span> <span class="vg v"> Ä &#603;&#720; das ä; Genitiv: des ä (umgangsspra +chlich: -s), ä (umgangssprachlich: -s) </span> </p>
      Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond

        My file format isn't exactly like what I gave earlier, but I don't think the difference is significant in this case. I'm doing some tag reductions and reformatting to prepare it for DB insertion, and there was no sense in posting all of the bloat here.

        I'd tried something before that had given me the results to be obtained by your line:

        $asText =~ s!\\x(..)!chr(hex($1))!ge;

        However, using that in conjunction with the subsequent "decode" process did the trick! I guess it required that specific TWO-STEP conversion process, and all of my attempts had stopped at one--at least within my code's conversion, not counting setting the file encodings on reading and writing. I'm no stranger to encoding issues, but hadn't worked with these slash-x octets before (I don't even know what they're supposed to be called), and these really threw me for a loop.

        So, THANK YOU so much!

        Blessings,

        ~Polyglot~

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11126832]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (4)
As of 2024-04-18 01:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found