Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

thread drift is allowed

by daxim (Curate)
on Sep 02, 2019 at 12:19 UTC ( #11105439=note: print w/replies, xml ) Need Help??


in reply to Re^4: Is there some universal Unicode+UTF8 switch?
in thread Is there some universal Unicode+UTF8 switch?

if it's ok to continue in the same thread then I will continue here
Thread drift is allowed. For good netiquette, also change the title in the reply form.

Replies are listed 'Best First'.
Proper Unicode handling in Perl
by VK (Novice) on Sep 02, 2019 at 14:00 UTC

    >Thread drift is allowed. For good netiquette, also change the title in the reply form.
    OK then. So and first of all I am not a staff developer of Wikipedia, just one of volunteer editors. We needed a script for a set of users willing to get notifications about upcoming internal elections, acting like a daemon (checking every 24 hrs some place and notify if there is something).
    tools.wmflabs.org gives you anything of your choice (Perl, PHP, Python, C#, you name it) in latest stable versions. I don't like Python, have no idea about C#, remember something about Perl - so I did Perl.

    This is to make it clear that the list=allusers query has nothing to do with the actual task. It is only to show the exact data format to query and to expect. The full MediaWiki API help is here: https://ru.wikipedia.org/w/api.php?action=help&uselang=en

    Now... The script has to be able to handle Unicode/UTF-8/whatever literals in the code: so I needed use utf8; It also has to output it in HTML- so I needed binmode STDOUT, ':utf8';
    It also has to receive JSON, decode it, slice it, string compare/replace and all other thing - all with Cyrillic in them. I dropped all (en|de)coding things called in this thread unnecessary so came to:

    #!/usr/bin/perl
    
    use strict;
    use warnings;
    
    use utf8;
    use Encode;
    
    use LWP::UserAgent;
    use HTTP::Request::Common;
    use HTTP::Cookies;
    
    use JSON;
    
    my $browser = LWP::UserAgent->new;
    
    # they ask to use descriptive user-agent - not LWP defaults
    # w:ru:User:Bot_of_the_Seven = https://ru.wikipedia.org/wiki/Участник:Bot_of_the_Seven
    $browser->agent('w:ru:User:Bot_of_the_Seven (LWP like Gecko) We come in peace');
    
    # I need cookies exchange enabled for auth
    # here is doesn't matter but to give full LWP picture:
    $browser->cookie_jar({});
    
    # a very few queries can be done by GET - most of MediaWiki require POST
    # so I do POST all around rather then remember where GET is allowed or not:
    my $response = $browser->request(POST 'https://ru.wikipedia.org/w/api.php',
            {
                'format' => 'json',
                'formatversion' => 2,
                'errorformat' => 'bc',
                    
                'action' => 'query',
                'list' => 'allusers',
                'auactiveusers' => 1,
                'aulimit' => 10,
                'aufrom' => 'Б'
            }
        );
    
    my $data = decode_json($response->content);
    
    my $test_scalar = $data->{query}->{allusers}[0]->{name};
    
    my @test_array = @{$data->{query}->{allusers}}[0..2];
    
    display_html($test_array[1]->{name});
    
    
    sub display_html {
    
        my @html = (
            '<!DOCTYPE html>',
            '<html>',
            '<head>',
            '<meta charset="UTF-8">',
            '<title>Мой тест</title>',
            '</head>',
            '<body>',
            shift // 'Статус — ОК', # soft OR: 0 and empty string accepted
            '</body>',
            '</html>'
        );
        
        # to avoid "wide character" warnings:
        binmode STDOUT, ':utf8';
        
        print "Content-Type: text/html; charset=utf-8\n\n";
        
        print join("\n", @html);
    }
    

    Is there anything that might go badly wrong concerning Cyrillic in Unicode/UTF-8?

      Nice progress! You don't even need the Encode module :)

      This is a pretty straightforward way to deal with Unicode and UTF-8.

      The remaining mentions of UTF-8 in your code have all their justification:

      • use utf8; tells Perl that your source code comes with UTF-8 encoded literals.
      • binmode STDOUT, ':utf8'; makes Perl spit out the strings in @html properly UTF-8 encoded. You can encode any Unicode character in UTF-8, so no problems here.
      • Content-Type: text/html; charset=utf-8 tells the browser that it has to handle the byte stream as UTF-8 and decode the characters accordingly.

      There are two caveats:

      • Obviously, You need to save your source code UTF-8 encoded.
      • You must check whether the JSON data might, in some circumstances, contain characters which have a special meaning in HTML, in particular < and &. This has nothing to do with Unicode, though. I'm adding the relevant stuff to your sub display_html:
        sub display_html {
            use HTML::Entities;
            my $html_encoded = encode_entities(shift, '<>&"');
            my @html = (
                '<!DOCTYPE html>',
                '<html>',
                '<head>',
                '<meta charset="UTF-8">',
                '<title>Мой тест</title>',
                '</head>',
                '<body>',
                $html_encoded // 'Статус — ОК', # soft OR: 0 and empty string accepted
                '</body>',
                '</html>'
            );
            
            # to avoid "wide character" warnings:
            binmode STDOUT, ':utf8';
            
            print "Content-Type: text/html; charset=utf-8\n\n";
            
            print join("\n", @html);
        }
        

        This thread is getting long, and I have a couple screenfuls of comments, output, and source, which I'll put between readmore tags to save scrollfingers...

        Thanks all for interesting comments,

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://11105439]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (6)
As of 2020-09-30 10:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    If at first I don’t succeed, I …










    Results (160 votes). Check out past polls.

    Notices?