Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

URIs and UTF8

by OverlordQ (Hermit)
on Apr 08, 2009 at 06:20 UTC ( [id://756242] : perlquestion . print w/replies, xml ) Need Help??

OverlordQ has asked for the wisdom of the Perl Monks concerning the following question:

Alright, in my Perl codings, I've done some work with respect to Wikipedia. One thing you'll find on Wikipedia is plenty of Unicode. Now unfortunately, I've come across some snags when trying to do some work. Since I'm not conversant with all the Black Magic(tm) with Character Encodings when I mention Unicode, I likely mean the UTF8 encoding of it.

Lets establish some facts:

  1. Titles can be Unicode strings
  2. Example title is: Rīga-Herson-Astrahan
  3. This (should) escaped to: R%C4%ABga-Herson-Astrahan
  4. When (not) marked as UTF8, it decodes correctly
  5. When marked as UTF8, it decodes incorrectly

Stepping through the code I have provided below, you eventually to URI at line 77:

DB<18> x $str 0 'http://en.wikipedia.org/w/api.php?prop=revisions&format=xml&titles +=User:OverlordQ/R&#299;ga-Herson-Astrahan&action=query&rvlimit=20'
The first run through the regex, it eats a character:
  DB<20> p $1
▒
  DB<21> x unpack("U*",$1);
0  196
Odd, oh well, let us let the regex finish until we get to line 78. Now lets see what the url contains:
DB<24> x $str 0 'http://en.wikipedia.org/w/api.php?prop=revisions&format=xml&titles +=User:OverlordQ/R%C3%84%C2%ABga-Herson-Astrahan&action=query&rvlimit= +20'
Hurm, not fun, that's not what we should have got. Bug? Or should I not be telling perl that these strings may contain utf8 characters. Example below. (It abuses the pre tag since the code tag eats the characters)
#!/usr/bin/perl use strict; use warnings; use lib '/home/overlordq/lib'; use LWP::UserAgent; use Data::Dumper; use DBI; use wikidb; $|++; my $ua = LWP::UserAgent->new(); my $dbh = DBI->connect("DBI:mysql:database=enwiki_p;host=sql-s1",$user +,$password); my $query = "SELECT page_title FROM page WHERE page_title LIKE 'Overlo +rdQ%' AND page_id = '22325873'"; my $sth = $dbh->prepare($query); $sth->execute(); my $title; while(my $ref = $sth->fetchrow_hashref() ) { $title = $ref->{'page_title'}; } print "Title: $title\n"; if( isUTF($title) ) { print "\tis UTF8\n"; } else { print "\tis not UTF8\n"; } my $res = $ua->post('http://en.wikipedia.org/w/api.php?prop=revisions& +format=xml&titles=User:' . $title . '&action=query&rvlimit=20'); my $uriUsed = $res->request->uri->as_string; print "URI: $uriUsed\n"; if( isUTF($title) ) { print "\tis already UTF8\n"; } else { utf8::upgrade($title); if( isUTF($title) ) { print "$title\n\tis now UTF8\n"; } } $res = $ua->post('http://en.wikipedia.org/w/api.php?prop=revisions&for +mat=xml&titles=User:' . $title . '&action=query&rvlimit=20'); $uriUsed = $res->request->uri->as_string; print "URI: $uriUsed\n"; print "Title: $title\n"; sub isUTF { my $string = shift; return utf8::is_utf8($string); }
Output:
Title: OverlordQ/Rīga-Herson-Astrahan
        is not UTF8
URI: ... User:OverlordQ/R%C4%ABga-Herson-Astrahan&action=query&rvlimit=20
OverlordQ/Rīga-Herson-Astrahan
        is now UTF8
URI: ... User:OverlordQ/R%C3%84%C2%ABga-Herson-Astrahan&action=query&rvlimit=20
Title: OverlordQ/Rīga-Herson-Astrahan
Now for my revised node:

Update: Since it looks like reapage is out of the picture, I'll put back what I can reccollect from memory. Yes it (ab)uses pre since code eats the characters.

Now why is this thar LWP mangling my UTF8 strings? Here is mah source:

#!/usr/bin/perl use strict; use warnings; use lib '/home/overlordq/lib'; use LWP::UserAgent; use Data::Dumper; use DBI; use wikidb; $|++; my $ua = LWP::UserAgent->new(); my $dbh = DBI->connect("DBI:mysql:database=enwiki_p;host=sql-s1",$user +,$password); my $query = <<SQL; SELECT p1.page_namespace AS namespace, p1.page_title AS title, rd_namespace, rd_title FROM redirect AS rd JOIN page p1 ON (rd.rd_from=p1.page_id) LEFT JOIN page AS p2 ON (rd_namespace=p2.page_namespace AND rd_title=p2.page_t +itle) WHERE rd_namespace >= 0 AND p2.page_namespace IS NULL AND p1.page_title LIKE 'OverlordQ%' SQL my $sth = $dbh->prepare($query); $sth->execute(); my $title; while(my $ref = $sth->fetchrow_hashref() ) { $title = $ref->{'title'}; print "$title\n"; my $prefix = 'http://en.wikipedia.org/w/api.php?prop=revisions&format= +json&titles=User:'; my $postfix = '&action=query&rvlimit=20'; if( isUTF($title) ) { print "\tis UTF8\n"; } else { print "\tis not UTF8\n"; } my $res = $ua->get($prefix.$title.$postfix); my $url = $res->request->uri->as_string; print "URI: $url\n"; if( isUTF($title) ) { print "\tis already UTF8\n"; } else { utf8::upgrade($title); if( isUTF($title) ) { print "$title\n\tis now UTF8\n"; } } $res = $ua->get($prefix.$title.$postfix); $url = $res->request->uri->as_string; print "URI: $url\n"; } sub isUTF { my $string = shift; return utf8::is_utf8($string); }
Output is:
OverlordQ/Rīga-Herson-Astrahan
        is not UTF8
URI: ... /R%C4%ABga-Herson-Astrahan&action=query&rvlimit=20
OverlordQ/Rīga-Herson-Astrahan
        is now UTF8
URI: ... /R%C3%84%C2%ABga-Herson-Astrahan&action=query&rvlimit=20
Of course what I should have done, had I not skimmed the Encode and perlunitut pages was I should have added:
binmode STDOUT, ':utf8';
Doesn't help that although my terminal handles utf8 that perl wont give it to me. So lets try add that at the top and see what we get:
OverlordQ/Rīga-Herson-Astrahan
        is not UTF8
URI: ... /R%C4%ABga-Herson-Astrahan&action=query&rvlimit=20
OverlordQ/Rīga-Herson-Astrahan
        is now UTF8
URI: ... /R%C3%84%C2%ABga-Herson-Astrahan&action=query&rvlimit=20
Ah hah, now we see why it's encoding to %C3%84%C2%AB instead of %C4%AB. Yay for character encoding, so lets finally throw in a
$title = decode('utf8',$title);
Underneath where it gets the title.
OverlordQ/Rīga-Herson-Astrahan
        is UTF8
URI: ... /R%C4%ABga-Herson-Astrahan&action=query&rvlimit=20
        is already UTF8
URI: ... /R%C4%ABga-Herson-Astrahan&action=query&rvlimit=20
So remember kids, flipping the flag without converting is bad, mkay.

Revised source:

#!/usr/bin/perl use strict; use warnings; use lib '/home/overlordq/lib'; use LWP::UserAgent; use Data::Dumper; use DBI; use Encode; use wikidb; $|++; binmode STDOUT, ":utf8"; my $ua = LWP::UserAgent->new(); my $dbh = DBI->connect("DBI:mysql:database=enwiki_p;host=sql-s1",$user +,$password); my $query = <<SQL; SELECT p1.page_namespace AS namespace, p1.page_title AS title, rd_namespace, rd_title FROM redirect AS rd JOIN page p1 ON (rd.rd_from=p1.page_id) LEFT JOIN page AS p2 ON (rd_namespace=p2.page_namespace AND rd_title=p2.page_t +itle) WHERE rd_namespace >= 0 AND p2.page_namespace IS NULL AND p1.page_title LIKE 'OverlordQ%' SQL my $sth = $dbh->prepare($query); $sth->execute(); my $title; while(my $ref = $sth->fetchrow_hashref() ) { $title = $ref->{'title'}; $title = decode('utf8',$title); print "$title\n"; my $prefix = 'http://en.wikipedia.org/w/api.php?prop=revisions&format= +json&titles=User:'; my $postfix = '&action=query&rvlimit=20'; if( isUTF($title) ) { print "\tis UTF8\n"; } else { print "\tis not UTF8\n"; } my $res = $ua->get($prefix.$title.$postfix); my $url = $res->request->uri->as_string; print "URI: $url\n"; if( isUTF($title) ) { print "\tis already UTF8\n"; } else { utf8::upgrade($title); if( isUTF($title) ) { print "$title\n\tis now UTF8\n"; } } $res = $ua->get($prefix.$title.$postfix); $url = $res->request->uri->as_string; print "URI: $url\n"; } sub isUTF { my $string = shift; return utf8::is_utf8($string); }

Replies are listed 'Best First'.
Re: URIs and UTF8
by Zen (Deacon) on Apr 08, 2009 at 13:41 UTC
    No one can learn from the mistake because you have erased the question text. Don't do that.
      it hath been restored
        You don't understand. The way this site works is sort of one part help site, one part giant faq. This means that the database of questions and answers encompassing many years worth of free help from professionals is available to the world. It's no good if someone asks their question, gets their free help, and then covers their tracks to prevent someone else from learning, too.

        It's true the docs help for a lot of questions, but it is possible to misunderstand docs or experience confusion. A monk even took the time to answer you. That person's answer and time spent is now diminished because there's no context. The point of this, it's not about you and your issues with utf8 or reading docs, it's about the way this site works and how it benefits others.
        A reply falls below the community's threshold of quality. You may see it by logging in.
Re: URIs and UTF8
by ikegami (Patriarch) on Apr 08, 2009 at 06:38 UTC

    You're feeding an invalid URL to LWP, so unexpected results are to be expected. I bet it works fine when you provide a valid URL.

    use Encode qw( encode decode ); use URI::Escape qw( uri_escape ); # From DB my $title = decode('UTF-8', "OverlordQ/R\x{C4}\x{AB}ga-Herson-Astrahan +"); # Escape each URL component. my @uri_components = map { uri_escape(encode('UTF-8', $_)) } split qr{/}, $title; # Prints OverlordQ/R%C4%ABga-Herson-Astrahan print(join('/', @uri_components), "\n");

    uri_escape(encode('UTF-8', $_)) can be written as uri_escape_utf8($_)

    Original content of the parent

    Alright, in my Perl codings, I've done some work with respect to Wikipedia. One thing you'll find on Wikipedia is plenty of Unicode. Now unfortunately, I've come across some snags when trying to do some work. Since I'm not conversant with all the Black Magic(tm) with Character Encodings when I mention Unicode, I likely mean the UTF8 encoding of it.

    Lets establish some facts:

    1. Titles can be Unicode strings
    2. Example title is: Rīga-Herson-Astrahan
    3. This (should) escaped to: R%C4%ABga-Herson-Astrahan
    4. When (not) marked as UTF8, it decodes correctly
    5. When marked as UTF8, it decodes incorrectly

    Stepping through the code I have provided below, you eventually to URI at line 77:

    DB<18> x $str 0 'http://en.wikipedia.org/w/api.php?prop=revisions&format=xml&titles +=User:OverlordQ/R&#299;ga-Herson-Astrahan&action=query&rvlimit=20'
    The first run through the regex, it eats a character:
      DB<20> p $1
    ▒
      DB<21> x unpack("U*",$1);
    0  196
    
    Odd, oh well, let us let the regex finish until we get to line 78. Now lets see what the url contains:
    DB<24> x $str 0 'http://en.wikipedia.org/w/api.php?prop=revisions&format=xml&titles +=User:OverlordQ/R%C3%84%C2%ABga-Herson-Astrahan&action=query&rvlimit= +20'
    Hurm, not fun, that's not what we should have got. Bug? Or should I not be telling perl that these strings may contain utf8 characters. Example below. (It abuses the pre tag since the code tag eats the characters)
    #!/usr/bin/perl use strict; use warnings; use lib '/home/overlordq/lib'; use LWP::UserAgent; use Data::Dumper; use DBI; use wikidb; $|++; my $ua = LWP::UserAgent->new(); my $dbh = DBI->connect("DBI:mysql:database=enwiki_p;host=sql-s1",$user +,$password); my $query = "SELECT page_title FROM page WHERE page_title LIKE 'Overlo +rdQ%' AND page_id = '22325873'"; my $sth = $dbh->prepare($query); $sth->execute(); my $title; while(my $ref = $sth->fetchrow_hashref() ) { $title = $ref->{'page_title'}; } print "Title: $title\n"; if( isUTF($title) ) { print "\tis UTF8\n"; } else { print "\tis not UTF8\n"; } my $res = $ua->post('http://en.wikipedia.org/w/api.php?prop=revisions& +format=xml&titles=User:' . $title . '&action=query&rvlimit=20'); my $uriUsed = $res->request->uri->as_string; print "URI: $uriUsed\n"; if( isUTF($title) ) { print "\tis already UTF8\n"; } else { utf8::upgrade($title); if( isUTF($title) ) { print "$title\n\tis now UTF8\n"; } } $res = $ua->post('http://en.wikipedia.org/w/api.php?prop=revisions&for +mat=xml&titles=User:' . $title . '&action=query&rvlimit=20'); $uriUsed = $res->request->uri->as_string; print "URI: $uriUsed\n"; print "Title: $title\n"; sub isUTF { my $string = shift; return utf8::is_utf8($string); }
    Output:
    Title: OverlordQ/Rīga-Herson-Astrahan
            is not UTF8
    URI: http://...?...&titles=User:OverlordQ/R%C4%ABga-Herson-Astrahan&...
    OverlordQ/Rīga-Herson-Astrahan
            is now UTF8
    URI: http://...?...&titles=User:OverlordQ/R%C3%84%C2%ABga-Herson-Astrahan&...
    Title: OverlordQ/Rīga-Herson-Astrahan

    Update: Shortened URLs in PRE tags as per reply.

      might want to nuke the long uneeded urls like I did in the original post, all that matters is the unicode text.