Re: chunking up texts correctly for online translation

Replies are listed 'Best First'.
Re^2: chunking up texts correctly for online translation by Aldebaran (Curate) on Jun 17, 2019 at 23:46 UTC
enable the :crlf PerlIO layer Thx haukex, that worked. Even so, the input to google exceeded their rate limit, so I had to slow it down. I added sleep time and a means to keep track of how long a file takes to translate. for my $file (@texts) { local $/ = ""; open my $fh, '<:crlf', $file or die; my $base_name = path("$file")->basename; my $out_file = path( $out_dir, $base_name )->touchpath; say "out_file is $out_file"; ## time it use Benchmark; my $t0 = Benchmark->new; while (<$fh>) { print "New Paragraph: $_"; my $r = get_trans( $wgt, $_ ); for my $trans_rh ( @{ $r->{data}->{translations} } ) { my $result = $trans_rh->{translatedText}; say "result is $result "; my @lines = split /\n/, $result; push @lines, "\n"; path("$out_file")->append_utf8(@lines); sleep(1); } } my $t1 = Benchmark->new; my $td = timediff( $t1, $t0 ); print "$file took:", timestr($td), "\n"; sleep(3); close $fh; [download] 84-0.txt is Shelley's Frankenstein, which is about 450 k in length. Of the $300 credit they give anyone to sign up for their API, I used 7 cents of it, so I'm down to $297.22 left. It made for an interesting way to skim both the original and the translation. This ballparks 20 minutes as an outer limit: `/home/bob/Documents/meditations/castaways/Translate1/data/84-0.txt too +k:1180 wallclock secs (23.34 usr + 1.36 sys = 24.70 CPU) $` [download] Q3: What do the usr and sys numbers mean? Module names in all lowercase are reserved (by convention) for pragmas, so I'd name your module Translate. Also, you're not checking your open for errors. I did fix both of these but went with Translate1 . The reason I did this is that I know there is going to be a Translate2 that will not work with Translate1. I've heard such naming called "trampolining," and something to be avoided. Q4: Am I supposed to not have such collisions using version numbers or clever use of git? The features of the package change quickly, and sometimes, I have to roll back to something that actually worked. I found that I had to go back to make clean every time I made a change in the script, so I wrote a little helper bash script: $ cat 1.google.sh #!/bin/bash pwd make clean perl Makefile.PL make make test make install ls cd blib cd script ./3.my_script.pl $ [download] I offer this as a keystroke reduction mechanism, not wanting to be OT. The translations went well with the exception of certain characters. Let's look at a couple paragraphs with differing tags. Here is output with pre tags New Paragraph: ā€œAre you mad, my friend?ā€¯ said he. ā€œOr whither does your senseless curiosity lead you? Would you also create for yourself and the world a demoniacal enemy? Peace, peace! Learn my miseries and do not seek to increase your own.ā€¯ result is - Ты злишься, друг мой? - спросил он. «Или куда ты бессмысленное любопытство приведет тебя? Не могли бы вы также создать для себя и мир демонический враг? Мир, мир! Узнай мои страдания и не ищи увеличить свой собственный. € New Paragraph: Frankenstein discovered that I made notes concerning his history; he asked to see them and then himself corrected and augmented them in many places, but principally in giving the life and spirit to the conversations he held with his enemy. ā€œSince you have preserved my narration,ā€¯ said he, ā€œI would not that a mutilated one should go down to posterity.ā€¯ result is Франкенштейн обнаружил, что я делал заметки, касающиеся его истории; он спросил чтобы увидеть их, а затем сам исправить и дополнить их во многих местах, но главным образом в том, чтобы дать жизнь и дух разговорам, которые он вел со своим врагом. "Так как вы сохранили мое повествование", сказал он, «Я бы не хотел, чтобы изуродованный posterity.ā Here is what the 1st paragraph looks like in code tags: `New Paragraph: āAre you mad, my friend?ā¯ said he. ā +Or whither does your senseless curiosity lead you? Would you also create for yourself and t +he world a demoniacal enemy? Peace, peace! Learn my miseries and do not s +eek to increase your own.ā¯` [download] For some reason, Shelley quotes paragraphs as a matter of course, and they are getting garbled as I read in under these conditions: `#!/usr/bin/perl -w use 5.011; use WWW::Google::Translate; use Data::Dumper; use open OUT => ':utf8'; use Path::Tiny; use lib "."; use translate; binmode STDOUT, 'utf8'; use POSIX qw(strftime);` [download] Google sometimes gives the correct rendering of quotes in russian. They do it somewhat like this: << >> . Q5: How do I change my script so that these characters are rendered correctly? They look right as I read them in gedit. Finally, as I look at the arguments in Makefile.Pl: `my %WriteMakefileArgs = ( NAME => 'Translate1', AUTHOR => q{gilligan <gilligan@island.coconut>}, VERSION_FROM => 'lib/Translate1.pm', LICENSE => 'artistic_2', MIN_PERL_VERSION => '5.006', CONFIGURE_REQUIRES => { 'ExtUtils::MakeMaker' => '0', }, TEST_REQUIRES => { 'Test::More' => '0', }, PREREQ_PM => { #'ABC' => '1.6', #'Foo::Bar::Module' => '5.0401', }, EXE_FILES => ['lib/3.my_script.pl'], dist => { COMPRESS => 'gzip -9f', SUFFIX => 'gz', }, clean => { FILES => 'Translate1-*' }, );` [download] Q6: How would I determine which version of WWW::Google::Translate to require? Thank you for your comments,	[reply] [d/l] [select]
Re^3: chunking up texts correctly for online translation by daxim (Curate) on Jun 19, 2019 at 11:15 UTC
Try `<:crlf:encoding(UTF-8)`, see PerlIO.	[reply] [d/l]
Re^4: chunking up texts correctly for online translation by Aldebaran (Curate) on Jun 27, 2019 at 18:15 UTC
That did work, daxim. I incorporated jdkrahn's criticisms as well. I'm doing more with bliako's scripts employing Getopt::Long but will reply to his meditation. Progress is slow: lots of reading. Sometimes getting the time for an appropriate write-up is the pinch-point. Thank Markov for the rain.... I did get a primitive, working module out of this thread, but as this thread is trailing off, I would like to focus on the last question I asked: Q6: How would I determine which version of WWW::Google::Translate to require? I have now gone in and read all I could find at module source. I wish I had done this initially, as I might have patterned my source off of dylan's. One thing I find is this: `Copyright (C) 2017 by Dylan Doxey This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.12.4 or, at your option, any later version of Perl 5 you may have available.` [download] Another is: `package WWW::Google::Translate; our $VERSION = '0.10';` [download] If I had to guess, I might think that this is the only version that has existed, but such an assessment is not made from the point of view of experience. In Makefile.PL , should I just go: `PREREQ_PM => { WWW::Google::Translate => '.1', },` [download] and be done with it? Thanks for your comment,	[reply] [d/l] [select]
Re^5: chunking up texts correctly for online translation by hippo (Bishop) on Jun 27, 2019 at 21:20 UTC
Re^5: chunking up texts correctly for online translation by haukex (Archbishop) on Jun 30, 2019 at 09:27 UTC
Re^3: chunking up texts correctly for online translation by haukex (Archbishop) on Jul 07, 2019 at 08:45 UTC
I was going through replies and I noticed there were some unanswered questions: `took:1180 wallclock secs (23.34 usr + 1.36 sys = 24.70 CPU)` Q3: What do the usr and sys numbers mean? "User time" is the amount of time spent in user-mode code (your code plus any libraries it's using), and "system time" is the amount of time spent in the kernel, such as system calls. Q4: Am I supposed to not have such collisions using version numbers or clever use of git? The features of the package change quickly, and sometimes, I have to roll back to something that actually worked. This depends very much on how you plan on using and releasing this module. If this is something you're going to release on CPAN, then it's definitely important to put some thought into naming and versioning. For example, it'd be best to work beneath a single namespace (just for example `Lingua::Translate::`), and especially not to pollute the top level with multiple namespaces such as `Translate1::` and `Translate2::` - instead, it'd be best to use a naming scheme such as `Translate::MyEngine::V1` and `Translate::MyEngine::V2`. On the other hand, if this something for your personal use, then you are free to do whatever you like and what is practical for you - you can do version control with Git, or, if you think that you'll be using multiple versions in parallel, naming like `Translate1` and `Translate2` (or maybe better: `Translate::V1` and `Translate::V2`) would probably work too. Of course, it's also possible to switch between these two development modes - I've done rapid prototyping in a repository that ended up being quite littered with experiments etc., and then when it came time to release, I set up a new, clean repository into which I just put the files that should be released, added proper versioning, better naming, etc. `ā€Are you mad, my friend?ā€¯ said he.` ... Q5: How do I change my script so that these characters are rendered correctly?* That's definitely an encoding problem, but you'd have to show us a *Short, Self-Contained, Correct Example that reproduces the issue. I showed an example of what information to provide in the case of encoding issues here. Q6: How would I determine which version of WWW::Google::Translate to require?* That depends on what features of the module you're using, or whether older versions had bugs that your code is having problems with. For example, the changelog shows that the `format` parameter was added in 0.06, `headers` in 0.08, and `model` in 0.10. Another thing to look for might be whether newer versions changed the dependencies. Usually, I'll require the lowest possible version of a module, unless there have been egregious bugs in older versions, in which case I'll require the version after those bugs were fixed.	[reply] [d/l] [select]


Don't ask to ask, just ask
	PerlMonks