Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Unicode Woes

by BigLug (Chaplain)
on Oct 01, 2004 at 08:31 UTC ( [id://395570]=perlquestion: print w/replies, xml ) Need Help??

BigLug has asked for the wisdom of the Perl Monks concerning the following question:

I'm getting translations from BabelFish. Everything is fine for the most part. I can get the translation out of the page.

However, there are a number of translations that return Unicode (Japanese and Greek are the ones I need).

If you look at the source code you'll see that it returns actual unicode characters rather than numbered entities.

Wise monks, please tell me: How do I get those characters out of LWP and into a database and then out into a website?

(My script caches the result in a database, but nevertheless it outputs immediately.)

I'm no Unicode hacker (yet!) and I need real simple help with this.


Cheers!
Rick

If this is a root node: Before responding, please ensure your clue bit is set.
If this is a reply: This is a discussion group, not a helpdesk ... If the discussion happens to answer a question you've asked, that's incidental.

Replies are listed 'Best First'.
Re: Unicode Woes
by gaal (Parson) on Oct 01, 2004 at 09:04 UTC
    You decide on one single encoding to store all your data in the database. If most of your data is English, and especially if you're in a unix environment, UTF-8 is the natural choice.

    Then you have to make sure that everything you send out is marked as UTF-8: The "Content-type" HTTP header should be set to "text/html; charset=utf-8".

    And, of course, you have to make sure everything you put in your database is in UTF-8, too. You can use Perl's Encode module to do this. If you know what encoding the input is in, it is easy. If you don't, it's less easy :)

      I've tried that ... I get the data from LWP, then send it through DBI to Postgres. However it ends up as a string of � characters. More importantly, when I send the recieved string out to the browser (with the header as you say) I similarly get nothing appearing.

      Cheers!
      Rick
      If this is a root node: Before responding, please ensure your clue bit is set.
      If this is a reply: This is a discussion group, not a helpdesk ... If the discussion happens to answer a question you've asked, that's incidental.
        There are many links in this chain, and if things don't work as a whole you have to go over them link by link to see where the problem(s) happen.

        View Unicode in hex offers a nice way of seeing what your actual data is. Adapt the code there to print what you get from LWP. Then make sure what gets fetched from the database is still UTF-8. Finally don't trust your browser, download the page that the web server handed you and see what information is actually there. (It might be your server is sending *two* Content-type headers, in which case only one of them (the wrong one, by Murphy's law) is honored by your browser.

        Oh, you also need to tell DBD::Pg that your data needs to be treated as UTF-8. Check out the pg_enable_utf8 attribute. (If you move to mysql one day, contact me for a patch giving similar functionality.)

        I suggested many things above but I recommend you tackle them one at a time, not all at once. That way if the first link in the chain was the only bad one you don't waste your time with the others.

Re: Unicode Woes
by BigLug (Chaplain) on Oct 01, 2004 at 09:47 UTC
    For those who want some code to play with:
    #!/usr/bin/perl use URI::Escape; use Encode; require LWP::UserAgent; my $escape = uri_escape(join('. ', @ARGV)); my $ua = LWP::UserAgent->new; my $response = $ua->get("http://babelfish.altavista.com/tr?trtext=$esc +ape&lp=en_ja"); if ($response->is_success) { $result = $response->content; # or whatever } else { die $response->status_line; } Encode::_utf8_on( $result ); my ($translation) = $result =~ /\Q<td bgcolor=white class=s><div style +=padding:10px;>\E(.+?)\Q<\/div>\E/; $original = $translation; $translation=~s/([^[:ascii:]])/sprintf("\\x{%.4x}",ord $1)/ge; print $translation ."\n". length($original) ."\n". ord(substr($origina +l,0,1));
    Run this (at least on my machine) and $translation has no visible contents, yet it has a length of 5!

    If your machine gives you something sensible, please let me know.

    (You can probably remove the Encode calls there .. that was just making sure that the resulting string *was* in utf8 according to perl)


    Cheers!
    Rick
    If this is a root node: Before responding, please ensure your clue bit is set.
    If this is a reply: This is a discussion group, not a helpdesk ... If the discussion happens to answer a question you've asked, that's incidental.
      $translation has no visible contents
      Using Data::Dumper and redirecting output into a file shows that it consists of ASCII NULs. Already $response as created by LWP is wrong that way. I tried with LWP::Simple, same thing. use open ':utf8'; does not help.
      Hrrm. I'm not well versed with LWP stuff. I went to that web site with a browser, typed in an English word and got back a Japanese word (in utf8) -- that's fine (the page source had nothing strange about it). I tried wget from the command line with the url string that you would post to get that same translation:
      $ wget -O /tmp/junk 'http://babelfish.altavista.com/tr?trtext=tree&lp +=en_ja'
      and I think wget gave me the same output that went to the browser -- that's fine. (But when I tried again later, it gave me a null byte where the Japanese should have been. Having overwritten the original try, I can't be sure now.)

      When I run your test script, $translation ends up with a null byte. I tried printing $result to STDERR, and redirected that to a file. The file (i.e. the full web page content returned by LWP->get) had null bytes where the browser (and maybe wget) output had a Japanese character.

      So I'm guessing there is something wrong with how you are making or sending the request to the server, but I can't imagine what to try next in order to figure out the problem and fix it. Good luck.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://395570]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (5)
As of 2024-04-19 20:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found