Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

This thread is getting long, and I have a couple screenfuls of comments, output, and source, which I'll put between readmore tags to save scrollfingers...

This is a pretty straightforward way to deal with Unicode and UTF-8.

With respect, I don't think the solution you give is gonna do the trick. Let me add that I'm no windows guru but find myself unable to replicate your result on the platform OP has. The reason that I looked at this thread is that I'm breaking in my new windows laptop with strawberry perl, and I wanted to see if I could do the basic things that OP seeks. In my experience, if it's working up Russian, there is a marsh of mojibake before results obtain.

Obviously, You need to save your source code UTF-8 encoded.

Is this a thing? To my understanding, it is the opinion of the software which opens the file as to what its encoding is. On the properties for the script I post here is no such option. This is output from a version of the script that shows the data in different formats. I'm gonna try pre tags here:


C:\Users\tblaz\Documents\evelyn>perl 2.cyr.pl
-------

  {
    name => "\x{411}\x{418}\x{411}\x{41B}\x{418}\x{41E}\x{422}\x{415}\x{41A}\x{410}\x{420}\x{42C}",
    recentactions => 38,
    userid => 1686692,
  },
  {
    name => "\x{411}\x{430}\x{431}\x{43A}\x{438}\x{43D}\x{44A} \x{41C}\x{438}\x{445}\x{430}\x{438}\x{43B}\x{44A}",
    recentactions => 144,
    userid => 2208294,
  },
  {
    name => "\x{411}\x{430}\x{434}\x{43C}\x{430} \x{425}\x{430}\x{440}\x{43B}\x{443}\x{435}\x{432}\x{430}",
    recentactions => 4,
    userid => 2587115,
  },

-------
$VAR1 = 
          {
            'recentactions' => 38,
            'userid' => 1686692,
            'name' => "\x{411}\x{418}\x{411}\x{41b}\x{418}\x{41e}\x{422}\x{415}\x{41a}\x{410}\x{420}\x{42c}"
          },
          {
            'name' => "\x{411}\x{430}\x{431}\x{43a}\x{438}\x{43d}\x{44a} \x{41c}\x{438}\x{445}\x{430}\x{438}\x{43b}\x{44a}",
            'recentactions' => 144,
            'userid' => 2208294
          },
          {
            'name' => "\x{411}\x{430}\x{434}\x{43c}\x{430} \x{425}\x{430}\x{440}\x{43b}\x{443}\x{435}\x{432}\x{430}",
            'recentactions' => 4,
            'userid' => 2587115
          }
        ;
-------
Content-Type: text/html; charset=utf-8

<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Мой тест</title>
</head>
<body>
Бабкинъ Михаилъ
</body>
</html>

Source that produced this:

#!/usr/bin/perl -w use 5.011; # use utf8; commenting out ## first time for this use utf8::all; use Encode; use LWP::UserAgent; use HTTP::Request::Common; use HTTP::Cookies; use JSON; use Data::Dump; use Data::Dumper; binmode STDOUT, ":utf8"; my $browser = LWP::UserAgent->new; # they ask to use descriptive user-agent - not LWP defaults # w:ru:User:Bot_of_the_Seven = https://ru.wikipedia.org/wiki/&#1059;&# +1095;&#1072;&#1089;&#1090;&#1085;&#1080;&#1082;:Bot_of_the_Seven $browser->agent('w:ru:User:Bot_of_the_Seven (LWP like Gecko) We come i +n peace'); # I need cookies exchange enabled for auth # here is doesn't matter but to give full LWP picture: $browser->cookie_jar( {} ); # a very few queries can be done by GET - most of MediaWiki require PO +ST # so I do POST all around rather then remember where GET is allowed or + not: my $response = $browser->request( POST 'https://ru.wikipedia.org/w/api.php', { 'format' => 'json', 'formatversion' => 2, 'errorformat' => 'bc', 'action' => 'query', 'list' => 'allusers', 'auactiveusers' => 1, 'aulimit' => 10, 'aufrom' => '&#1041;' } ); my $data = decode_json( $response->content ); my $test_scalar = $data->{query}->{allusers}[0]->{name}; my @test_array = @{ $data->{query}->{allusers} }[ 0 .. 2 ]; say "test array is @test_array"; say "-------"; dd \@test_array; say "-------"; print Dumper \@test_array; say "-------"; display_html( $test_array[1]->{name} ); sub display_html { use HTML::Entities; my $html_encoded = encode_entities(shift, '<>&"'); my @html = ( '<!DOCTYPE html>', '<html>', '<head>', '<meta charset="UTF-8">', '<title>&#1052;&#1086;&#1081; &#1090;&#1077;&#1089;&#1090;</ti +tle>', '</head>', '<body>', $html_encoded // '&#1057;&#1090;&#1072;&#1090;&#1091;&#1089; — + &#1054;&#1050;', # soft OR: 0 and empty string accepted '</body>', '</html>' ); # to avoid "wide character" warnings: binmode STDOUT, ':utf8'; print "Content-Type: text/html; charset=utf-8\n\n"; print join("\n", @html); } __END__
You must check whether the JSON data might, in some circumstances, contain characters which have a special meaning in HTML, in particular < and &.

I have tried this script both with and without your changes to the html display, yet his test does not render. Meanwhile, I can read it fine in Notepad and Notepad++. Telling for me is when I asked for a listing on STDOUT. I'll try this abbreviated and with code tags:

Edit

Showing source listing from haj's subroutine

/Edit
C:\Users\tblaz\Documents\evelyn>type 2.cyr.pl #!/usr/bin/perl -w use 5.011; ... sub display_html { use HTML::Entities; my $html_encoded = encode_entities(shift, '<>&"'); my @html = ( '<!DOCTYPE html>', '<html>', '<head>', '<meta charset="UTF-8">', '<title>&#9576;£&#9576;&#9563;&#9576;&#9571; &#9572;é&#9576;&# +9569;&#9572;ü&#9572;é</title>', '</head>', '<body>', $html_encoded // '&#9576;í&#9572;é&#9576;&#9617;&#9572;é&#9572 +;â&#9572;ü &#915;Çö &#9576;&#8359;&#9576;Ü', # soft OR: 0 and empty s +tring accepted '</body>', '</html>' ); # to avoid "wide character" warnings: binmode STDOUT, ':utf8'; print "Content-Type: text/html; charset=utf-8\n\n"; print join("\n", @html); }

What I see is that "My test" does not even render here. To my eye, he has all of the russian on the hook with his data queries; it's just not getting represented correctly on the terminal that Strawberry Perl gives you. His install might be as fresh out of the box as mine.

To illustrate what I think is going on, I created a smaller script:

C:\Users\tblaz\Documents\evelyn>perl 1.hello.cyr.pl &#9576;ƒ&#9572;Ç&#9576;&#9557;&#9576;&#9619;&#9576;&#9569;&#9572;é

Source listing:

#!/usr/bin/perl -w use 5.016; use utf8::all; #binmode STDOUT, ":utf8"; say "&#1055;&#1088;&#1080;&#1074;&#1077;&#1090;"; __END__

This one line might best be represented with a p tag:

say "Привет";

Anyways, it seems like there's some wonky IO layer going on here...

Thanks all for interesting comments,


In reply to Re^2: Proper Unicode handling in Perl by Aldebaran
in thread Is there some universal Unicode+UTF8 switch? by VK

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (1)
As of 2024-04-25 01:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found