Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: chunking up texts correctly for online translation

by haukex (Archbishop)
on Jun 14, 2019 at 22:02 UTC ( [id://11101368]=note: print w/replies, xml ) Need Help??


in reply to chunking up texts correctly for online translation

Q1: How do I rewrite my script so that I get paragraph-sized chunks getting sent to google regardless of line feed encoding?

You're right that paragraph mode ($/ = "") doesn't work right when reading a CRLF file on *NIX. You could enable the :crlf PerlIO layer, which will leave plain LF as is but convert CRLF to LF, and paragraph mode will work: open my $fh, '<:crlf', $file

By the way, to pick two nits: Module names in all lowercase are reserved (by convention) for pragmas, so I'd name your module Translate. Also, you're not checking your open for errors.

Q2: Do I really need all of this to extract one paragraph of translation?

In general I'd say get it fully working first, and leave the simplification of the code for a little later :-)

Replies are listed 'Best First'.
Re^2: chunking up texts correctly for online translation
by Aldebaran (Curate) on Jun 17, 2019 at 23:46 UTC
    enable the :crlf PerlIO layer

    Thx haukex, that worked. Even so, the input to google exceeded their rate limit, so I had to slow it down. I added sleep time and a means to keep track of how long a file takes to translate.

    for my $file (@texts) { local $/ = ""; open my $fh, '<:crlf', $file or die; my $base_name = path("$file")->basename; my $out_file = path( $out_dir, $base_name )->touchpath; say "out_file is $out_file"; ## time it use Benchmark; my $t0 = Benchmark->new; while (<$fh>) { print "New Paragraph: $_"; my $r = get_trans( $wgt, $_ ); for my $trans_rh ( @{ $r->{data}->{translations} } ) { my $result = $trans_rh->{translatedText}; say "result is $result "; my @lines = split /\n/, $result; push @lines, "\n"; path("$out_file")->append_utf8(@lines); sleep(1); } } my $t1 = Benchmark->new; my $td = timediff( $t1, $t0 ); print "$file took:", timestr($td), "\n"; sleep(3); close $fh;

    84-0.txt is Shelley's Frankenstein, which is about 450 k in length. Of the $300 credit they give anyone to sign up for their API, I used 7 cents of it, so I'm down to $297.22 left. It made for an interesting way to skim both the original and the translation. This ballparks 20 minutes as an outer limit:

    /home/bob/Documents/meditations/castaways/Translate1/data/84-0.txt too +k:1180 wallclock secs (23.34 usr + 1.36 sys = 24.70 CPU) $

    Q3: What do the usr and sys numbers mean?

    Module names in all lowercase are reserved (by convention) for pragmas, so I'd name your module Translate. Also, you're not checking your open for errors.

    I did fix both of these but went with Translate1 . The reason I did this is that I know there is going to be a Translate2 that will not work with Translate1. I've heard such naming called "trampolining," and something to be avoided. Q4: Am I supposed to not have such collisions using version numbers or clever use of git? The features of the package change quickly, and sometimes, I have to roll back to something that actually worked.

    I found that I had to go back to make clean every time I made a change in the script, so I wrote a little helper bash script:

    $ cat 1.google.sh #!/bin/bash pwd make clean perl Makefile.PL make make test make install ls cd blib cd script ./3.my_script.pl $

    I offer this as a keystroke reduction mechanism, not wanting to be OT.

    The translations went well with the exception of certain characters. Let's look at a couple paragraphs with differing tags. Here is output with pre tags

    New Paragraph: €œAre you mad, my friend?€ said he. €œOr whither does your
    senseless curiosity lead you? Would you also create for yourself and the
    world a demoniacal enemy? Peace, peace! Learn my miseries and do not seek
    to increase your own.€
    
    result is - Ты злишься, друг мой? - спросил он. Или куда ты
    бессмысленное любопытство приведет тебя? Не могли бы вы также создать для себя и
    мир демонический враг? Мир, мир! Узнай мои страдания и не ищи
    увеличить свой собственный. 
    
     
    New Paragraph: Frankenstein discovered that I made notes concerning his history; he asked
    to see them and then himself corrected and augmented them in many places,
    but principally in giving the life and spirit to the conversations he held
    with his enemy. €œSince you have preserved my narration,€ said
    he, €œI would not that a mutilated one should go down to
    posterity.€
    
    result is Франкенштейн обнаружил, что я делал заметки, касающиеся его истории; он спросил
    чтобы увидеть их, а затем сам исправить и дополнить их во многих местах,
    но главным образом в том, чтобы дать жизнь и дух разговорам, которые он вел
    со своим врагом. "Так как вы сохранили мое повествование", сказал
    он, Я бы не хотел, чтобы изуродованный
    posterity.

    Here is what the 1st paragraph looks like in code tags:

    New Paragraph: &#128;&#156;Are you mad, my friend?&#128; said he. +&#128;&#156;Or whither does your senseless curiosity lead you? Would you also create for yourself and t +he world a demoniacal enemy? Peace, peace! Learn my miseries and do not s +eek to increase your own.&#128;

    For some reason, Shelley quotes paragraphs as a matter of course, and they are getting garbled as I read in under these conditions:

    #!/usr/bin/perl -w use 5.011; use WWW::Google::Translate; use Data::Dumper; use open OUT => ':utf8'; use Path::Tiny; use lib "."; use translate; binmode STDOUT, 'utf8'; use POSIX qw(strftime);

    Google sometimes gives the correct rendering of quotes in russian. They do it somewhat like this: << >> .

    Q5: How do I change my script so that these characters are rendered correctly? They look right as I read them in gedit.

    Finally, as I look at the arguments in Makefile.Pl:

    my %WriteMakefileArgs = ( NAME => 'Translate1', AUTHOR => q{gilligan <gilligan@island.coconut>}, VERSION_FROM => 'lib/Translate1.pm', LICENSE => 'artistic_2', MIN_PERL_VERSION => '5.006', CONFIGURE_REQUIRES => { 'ExtUtils::MakeMaker' => '0', }, TEST_REQUIRES => { 'Test::More' => '0', }, PREREQ_PM => { #'ABC' => '1.6', #'Foo::Bar::Module' => '5.0401', }, EXE_FILES => ['lib/3.my_script.pl'], dist => { COMPRESS => 'gzip -9f', SUFFIX => 'gz', }, clean => { FILES => 'Translate1-*' }, );

    Q6: How would I determine which version of WWW::Google::Translate to require?

    Thank you for your comments,

      Try <:crlf:encoding(UTF-8), see PerlIO.

        That did work, daxim. I incorporated jdkrahn's criticisms as well. I'm doing more with bliako's scripts employing Getopt::Long but will reply to his meditation. Progress is slow: lots of reading. Sometimes getting the time for an appropriate write-up is the pinch-point. Thank Markov for the rain....

        I did get a primitive, working module out of this thread, but as this thread is trailing off, I would like to focus on the last question I asked:

        Q6: How would I determine which version of WWW::Google::Translate to require?

        I have now gone in and read all I could find at module source. I wish I had done this initially, as I might have patterned my source off of dylan's. One thing I find is this:

        Copyright (C) 2017 by Dylan Doxey This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.12.4 or, at your option, any later version of Perl 5 you may have available.

        Another is:

        package WWW::Google::Translate; our $VERSION = '0.10';

        If I had to guess, I might think that this is the only version that has existed, but such an assessment is not made from the point of view of experience. In Makefile.PL , should I just go:

        PREREQ_PM => { WWW::Google::Translate => '.1', },

        and be done with it? Thanks for your comment,

      I was going through replies and I noticed there were some unanswered questions:

      took:1180 wallclock secs (23.34 usr +  1.36 sys = 24.70 CPU) Q3: What do the usr and sys numbers mean?

      "User time" is the amount of time spent in user-mode code (your code plus any libraries it's using), and "system time" is the amount of time spent in the kernel, such as system calls.

      Q4: Am I supposed to not have such collisions using version numbers or clever use of git? The features of the package change quickly, and sometimes, I have to roll back to something that actually worked.

      This depends very much on how you plan on using and releasing this module. If this is something you're going to release on CPAN, then it's definitely important to put some thought into naming and versioning. For example, it'd be best to work beneath a single namespace (just for example Lingua::Translate::*), and especially not to pollute the top level with multiple namespaces such as Translate1:: and Translate2:: - instead, it'd be best to use a naming scheme such as Translate::MyEngine::V1 and Translate::MyEngine::V2.

      On the other hand, if this something for your personal use, then you are free to do whatever you like and what is practical for you - you can do version control with Git, or, if you think that you'll be using multiple versions in parallel, naming like Translate1 and Translate2 (or maybe better: Translate::V1 and Translate::V2) would probably work too. Of course, it's also possible to switch between these two development modes - I've done rapid prototyping in a repository that ended up being quite littered with experiments etc., and then when it came time to release, I set up a new, clean repository into which I just put the files that should be released, added proper versioning, better naming, etc.

      “Are you mad, my friend?” said he. ... Q5: How do I change my script so that these characters are rendered correctly?

      That's definitely an encoding problem, but you'd have to show us a Short, Self-Contained, Correct Example that reproduces the issue. I showed an example of what information to provide in the case of encoding issues here.

      Q6: How would I determine which version of WWW::Google::Translate to require?

      That depends on what features of the module you're using, or whether older versions had bugs that your code is having problems with. For example, the changelog shows that the format parameter was added in 0.06, headers in 0.08, and model in 0.10. Another thing to look for might be whether newer versions changed the dependencies. Usually, I'll require the lowest possible version of a module, unless there have been egregious bugs in older versions, in which case I'll require the version after those bugs were fixed.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11101368]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (3)
As of 2024-04-19 17:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found