Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

De-googleizing translation scripts

by Aldebaran (Curate)
on Nov 05, 2022 at 03:12 UTC ( [id://11147980] : perlquestion . print w/replies, xml ) Need Help??

Aldebaran has asked for the wisdom of the Perl Monks concerning the following question:

I have to divide my life into things that I can change and those I can't, and I wasn't prepared for how quickly I would need to look for google alternatives when they alleged I was a computer and locked me out of my account. I had to realize how dished out to them I truly am. I did not choose for this to be the year I would have to talk about Google, and would be embarrassed for others to know how much this feels like a divorce or jilted lover saga. Google and I: we had some good times, but we've grown apart. Having to switch from my familiar android just takes my breathe away, but I'll come out the other side. So, I'll go with apple until they have some outrageous untenability, and so it goes with the corporate goliaths: pick your poison and exposure.

I rely on google heavily in translation scripts and tried to work up a reasonably-sized SSCCE for it. This script worked better before it worked worse, but it should suffice as a starting point:

#!/usr/bin/perl use v5.030; # strictness implied use warnings; use Path::Tiny; my $file_in = path("/home/fritz/Desktop/1.enchanto.txt"); my $file_out = path('/home/fritz/Desktop/1.enc_trans.txt'); my $lang = 'es'; my $command = 'trans -b :$lang "$para">$buffer'; my $guts = $file_in->slurp_utf8; my @spl = split('\n', $guts); say @spl; my $buffer; for my $para (@spl){ say $para; my $trans = system("$command"); say $buffer; #$file_out->spew_utf8( $para, $trans ); } __END__

What I typically do is use perl to wrap the trans call from the soimort translate shell on github. I'm not wild about this dependency anymore, and the resulting calling code is hideous, but it has been reliable in at least this form:

foreach (@matching2) { my $eng_path = path( $vars{eng_captions}, $_ ); say $fh "##$_##"; my $rus_path = path( $vars{rus_captions}, $_ )->touchpath; say "rus_path is $rus_path"; my $content = path($eng_path)->slurp_utf8; $content =~ s/^\s+|\s+$//g; say $fh "$content"; system("trans :$lang file://$eng_path >$rus_path"); }

I'd like to get away both from this trans package and google, so I'm wondering what else is out there. I think this answer has changed over the last year or so with the balkanization of the internet. For example, in the following snippet, I think the bing one succeeds and the yandex has failed (from my perspective), and this, as I say, over the course of the last year:

print "Get other translations(y/n)?: "; my $prompt = <STDIN>; chomp $prompt; if ( $prompt eq ( "y" or "Y" ) ) { my @translators = qw /yandex bing/; for my $remote (@translators) { my $trans_munge = path( $vars{translations}, "$remote." . $munge + ); ## use trans shell say "getting translation from $remote"; system("trans :$lang -e $remote file://$in_path >$trans_munge"); }

So let me ask the question like this, given that I want to get away from the trans package and google, what options do I have for machine translations with perl?

Thanks for your comment,

Replies are listed 'Best First'.
Re: De-googleizing translation scripts
by kikuchiyo (Hermit) on Nov 05, 2022 at 22:07 UTC

    Nowadays DeepL is considered to produce better results than Google Translate. It even has an HTTP API, for which you can register for free (for some value of free): https://www.deepl.com/pro-api?cta=header-pro-api

    I don't know about this "trans" command of yours, but coding a wrapper script for a HTTP API is trivial in Perl.

      Nowadays DeepL is considered to produce better results than Google Translate. It even has an HTTP API, for which you can register for free (for some value of free):

      Thx for your reply, kikuchiyo, I think this is gonna work out for me. They do make you put a credit card on record, but I'm willing to offer that kind of skin in this game. DeepL seems more trustworthy than Google. I'm excited to see what capabilities this service can provide.

      I don't know about this "trans" command of yours, but coding a wrapper script for a HTTP API is trivial in Perl.

      Well, I don't know about "trivial." Maybe for corion and bliako, but I'm a garden-variety human who fumbles the ball and needs to consult. I was proud that I remembered corion's curl converter, from which I got this:

      #!perl use strict; use warnings; use HTTP::Tiny; my $ua = HTTP::Tiny->new( 'verify_SSL' => '1' ); my $res = $ua->request( 'POST' => 'https://api-free.deepl.com/v2/translate', { headers => { 'Authorization' => 'DeepL-Auth-Key redacted', 'Content-Length' => '37', 'Accept' => '*/*', 'Content-Type' => 'application/x-www-form-urlencoded', 'User-Agent' => 'curl/7.55.1' }, content => "text=Hello\x252C\x2520world!&target_lang=DE" }, ); __END__ Created from curl command line curl -X POST 'https://api-free.deepl.com/v2/translate' -H 'Author +ization: DeepL-Auth-Key redacted' -d 'text=Hello%2C%20world!' + -d 'target_lang=DE'

      But I run into trouble decoding the json:

      fritz@laptop:~/Documents$ ./3.trans.pl Hello neighbor on Watercress lane, {"translations":[{"detected_source_language":"EN","text":"Hola vecino +de Watercress lane,"}]}content is {"translations":[{"detected_source_ +language":"EN","text":"Hola vecino de Watercress lane,"}]} data is HASH(0x55fbf7cf4140) ... Anyways, I start getting letters saying that I have not complied with +this declaration, which had the bizarre predicate that we had to come + to their residence to prove that we had complied. One thing I can pr +omise you: I will never cross their threshold, because I don't want t +o know them at all based on what they stuffed into my mailbox. {"translations":[{"detected_source_language":"EN","text":"De todos mod +os, empiezo a recibir cartas diciendo que no he cumplido con esta dec +laración, que tenía el extraño predicado de que teníamos que ir a su +residencia para demostrar que habíamos cumplido. Una cosa puedo prome +ter: Nunca cruzaré su umbral, porque no quiero conocerlos en absoluto + basándome en lo que me metieron en el buzón."}]}content is {"transla +tions":[{"detected_source_language":"EN","text":"De todos modos, empi +ezo a recibir cartas diciendo que no he cumplido con esta declaración +, que tenía el extraño predicado de que teníamos que ir a su residenc +ia para demostrar que habíamos cumplido. Una cosa puedo prometer: Nun +ca cruzaré su umbral, porque no quiero conocerlos en absoluto basándo +me en lo que me metieron en el buzón."}]} data is HASH(0x55fbf8768858) fritz@laptop:~/Documents$ ^C

      Source:

      #!/usr/bin/perl use v5.030; # strictness implied use warnings; use Path::Tiny; use HTTP::Tiny; use JSON::MaybeXS; my $file_in = path("/home/fritz/Desktop/1.enchanto.txt"); my $file_out = path('/home/fritz/Desktop/1.enc_trans.txt'); my $lang = 'es'; my $guts = $file_in->slurp_utf8; my @spl = split( '\n', $guts ); my $ua = HTTP::Tiny->new( 'verify_SSL' => '1' ); for my $para (@spl) { say $para; my $payload = "text=$para&target_lang=$lang"; my $payloadlen = length($payload); my $response = $ua->request( 'POST' => 'https://api-free.deepl.com/v2/translate', { headers => { 'Authorization' => 'DeepL-Auth-Key redacted', 'Content-Length' => $payloadlen, 'Accept' => '*/*', 'Content-Type' => 'application/x-www-form-urlencoded', 'User-Agent' => 'curl/7.55.1' }, content => $payload, }, ); die "Failed!\n" unless $response->{success}; print $response->{content} if length $response->{content}; my $content = $response->{content}; say "content is $content"; my $data = decode_json($content); say "data is $data"; $file_out->spew_utf8( $para, $data ); } __END__

      I typically use bliako's software for this, but I couldn't reconcile that with HTTP::Tiny:

      use LWP::UserAgent; use HTTP::Request; use Data::Roundtrip; ... my $req = HTTP::Request->new( ... $response = $ua->request($req); die "Error fetching: " . $response->status_line unless $response->is_success; my $content = $response->decoded_content; my $data = Data::Roundtrip::json2perl($content); die "failed to parse received data:\n$content\n" unless exists $data->{'elevation'}; return $data->{'elevation'};

      In particular I don't see how to do this without these modules:

      my $content = $response->decoded_content; my $data = Data::Roundtrip::json2perl($content);

      Anyways, I'm elated that I have spanish that I don't understand already and hope that someone can help me over the finish line with the json.

      Cheers from the Rocky Mountains,

        data is HASH(0x55fbf8768858)

        You are receiving a JSON string from the remote server with your script (great!), that's stored in $response->decoded_content. Then you correctly convert that string, using decode_json(), into a perl data structure and store it in variable $data, in this case, of type HASH. You can use this data structure ($data) as usual, e.g. my $text1 = $data->{'translations'}->[0]->{'text'}. The data structure is this, for my case:

        { 'translations' => [ { 'text' => 'vencino hola', 'detected_source_language' => 'ES' } ] };

        If your question is how to print this data structure ($data) and get something meaningful instead of data is HASH(0x55fbf8768858), then there are lots of choices, I know of 2: Data::Dumper's Dumper() and Data::Roundtrip's perl2dump()*, which you mentioned already. Pick your poison.

        Of course you can write your own "data dumper", and that would be a nice climb up Recursion Peak and the Monastery is right behind you.

        Note that you have included an auth-key in your SCSE. You don't want that. *They* have now linked your CC, your translations and your monk handle and thus your comments. Brrrr (but hey the danger is not with "They" but with evil dictators outside Western Democracies /sic/ /sarcasm-off)

        bw, bliako

        Edit: *) Data::Roundtrip depends on Data::Dumper, so it would be simpler to use the latter, the former offers data converters and an easy way to "not-bloody-escape-unicode" which the latter does incessantly, to my eyeballs' irritation.

        See here for some more ideas for your client. See also WWW::Curl. If I didn’t mention this already.

        And remember from the DeepL API:

        "… You should not put the key in publicly-distributed code… If your authentication key becomes compromised, you can recreate a new key and discard the old one in your account settings."

        Regards, Karl

        «The Crux of the Biscuit is the Apostrophe»

      "…better results…"

      Out of curiosity, let's compare:

      «The Crux of the Biscuit is the Apostrophe»

        I'm really impressed by both of those. (Caveat: I don't really know German, I mostly understand it by analogy with Dutch - I'm interested to know what a native speaker thinks.)

        I'm not sure, but I suspect the DeepL transcript is slightly the more idiomatic. I'm slightly confused, however, by "Sie kann die Welt beeinflussen" - the switch from "er" used in the preceding sentences to "sie" seems odd (contrasted with the consistent "es" in the Google translation), but maybe there's some grammatical requirement.

        FWIW I've been struggling over the last few months with Google and Yandex translations of English <-> Russian, in correspondence with a group of mathematicians. Both of them seem to do a pretty terrible job translating those mathematical discussions - certainly much worse than the quality of these English-German translations might cause me to expect. Restricting myself to short, idiom-free sentences does not appear to have helped.

Re: De-googleizing translation scripts
by Bod (Parson) on Nov 05, 2022 at 11:55 UTC

    Some years ago I used Lingua::Translate to convert from English to French.
    It was not a complete success!

    As Google Translate improved, the need to go elsewhere pretty much disappeared and, typically, I wouldn't consider going anywhere else as they seem to have the best-in-class solution. But your case is an exception...

    Lingua::Translate used Babelfish from Yahoo! which I believe was shut down. But there is BabelFish which may have an API that is helpful to you.