Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re^3: Perl's encoding versus UTF8 octets

by GrandFather (Saint)
on Jan 13, 2021 at 05:57 UTC ( [id://11126838]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Perl's encoding versus UTF8 octets
in thread Perl's encoding versus UTF8 octets

What is actually stored in your files? The literal text you provided in the string, or something else? If it is the text in the string then you can:

use strict; use warnings; use Encode; binmode STDOUT, ':utf8'; print "Content-Type:text/html; charset=utf-8\n"; print "Content-Language: utf8;\n\n"; my $asText = do {local $/; <DATA>}; $asText =~ s!\\x(..)!chr(hex($1))!ge; my $uCode; my $newcode = decode('utf8', $asText); print "<p>$newcode</p>\n"; __DATA__ \xc3\xa4 <span class="sy">\xc3\xa4</span>, <span class="sy">\xc3\x84</span> <span class="posg pos">Substantiv, Neutrum, das</span> <span class="vg v"> \xc3\x84 \xc9\x9b\xcb\x90 das \xc3\xa4; Genitiv: +des \xc3\xa4 (umgangssprachlich: -s), \xc3\xa4 (umgangssprachlich: -s +) </span>

Prints:

Content-Type:text/html; charset=utf-8 Content-Language: utf8; <p>ä <span class="sy">ä</span>, <span class="sy">Ä</span> <span class="posg pos">Substantiv, Neutrum, das</span> <span class="vg v"> Ä &#603;&#720; das ä; Genitiv: des ä (umgangsspra +chlich: -s), ä (umgangssprachlich: -s) </span> </p>
Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond

Replies are listed 'Best First'.
Re^4: Perl's encoding versus UTF8 octets
by Polyglot (Chaplain) on Jan 13, 2021 at 07:00 UTC

    My file format isn't exactly like what I gave earlier, but I don't think the difference is significant in this case. I'm doing some tag reductions and reformatting to prepare it for DB insertion, and there was no sense in posting all of the bloat here.

    I'd tried something before that had given me the results to be obtained by your line:

    $asText =~ s!\\x(..)!chr(hex($1))!ge;

    However, using that in conjunction with the subsequent "decode" process did the trick! I guess it required that specific TWO-STEP conversion process, and all of my attempts had stopped at one--at least within my code's conversion, not counting setting the file encodings on reading and writing. I'm no stranger to encoding issues, but hadn't worked with these slash-x octets before (I don't even know what they're supposed to be called), and these really threw me for a loop.

    So, THANK YOU so much!

    Blessings,

    ~Polyglot~

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11126838]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (5)
As of 2024-04-25 12:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found