Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Translating source code from Japanese to English using WWW::Babelfish

by isotope (Deacon)
on Oct 15, 2001 at 21:14 UTC ( [id://118941]=CUFP: print w/replies, xml ) Need Help??

A year or so ago, I started working on a project we inherited from our Japanese parent company. As part of that, we received a substantial amount of source code, all commented in Japanese. We tried several ways to translate that source code, but the translations were mostly non-sensical and usually ended up breaking the code.

After someone in the company pointed out that we could translate comments individually by opening the source in Microsoft Word, then copying and pasting into Internet Explorer (so the multi-byte character encoding would be handled properly), we could use Babelfish to translate. The light bulb turned on and I started working on a program to translate entire files.

For any CodeWright users out there, I discovered that CW's built-in perl is too broken to mess with, so I abandoned the macro idea. Instead, I focused on translating the entire file from the command line.

In the last few weeks, I finally figured out how to translate without horribly disfiguring the code. Since Japanese phrases consist of a sequence of characters with no spaces, I focus on chunks of non-whitespace for the translation, and replace the translated text in-place, thus avoiding the whitespace-munging Babelfish unfortunately performs. This also made it really easy to build a translation dictionary to avoid repeat lookups on Babelfish.

I developed the program with Jcode version 0.68 and a patched version of WWW::Babelfish 0.09. The Patch.

#!perl -w use strict; ###################################################################### +##### # jtoeng.pl # # A Japanese to English file translator # by Brett T. Warden # NEC Eluminant Technologies, Inc. # Created 21 February 2001 # Lastmod 15 October 2001 ###################################################################### +##### use Jcode; use WWW::Babelfish; use Storable qw(nfreeze thaw); use File::Basename; #use Data::Dumper; my $DEBUG = 0; print "\nConnecting to translator... please wait.\n\n" if $DEBUG; my $babel = new WWW::Babelfish(); die( "Babelfish server unavailable\n" ) unless defined($babel); my %dict; if(open(DICT, '< jtoeng.dict')) { binmode(DICT); local($/); my $frozen = <DICT>; if(my $ref = thaw($frozen)) { # Yeah this is inefficient. Ideally I'd use a DB anyway. %dict = %{$ref}; } close(DICT); } if(@ARGV) { ARG: for(@ARGV) { print "Trying to read $_\n"; if(open(IFILE, "<" . $_)) { binmode(IFILE); my ($name, $path, $suffix) = fileparse($_, '\..*'); my $outfile = $path . $name . '.english' . $suffix; print "Preparing $outfile\n"; if(open(OFILE, ">" . $outfile)) { binmode(OFILE); my $fh = select(OFILE); $| = 1; select($fh); my $TRANSLATIONS = 0; my $BABELFISHINGS = 0; # Translate print "Translating $_\n"; translate(\*IFILE, \*OFILE, \$TRANSLATIONS, \$BABELFIS +HINGS); close(OFILE); print "Performed $TRANSLATIONS translations, of which +$BABELFISHINGS were directly requested from Babelfish.\n"; print "Translation complete\n\n"; } else { die "Unable to write $outfile: $!\n"; } close(IFILE); } else { warn "Unable to read $_: $!\n"; next ARG; } } } else { my $TRANSLATIONS = 0; my $BABELFISHINGS = 0; translate(\*STDIN, \*STDOUT, \$TRANSLATIONS, \$BABELFISHINGS); print "Performed $TRANSLATIONS translations, of which $BABELFISHIN +GS were directly requested from Babelfish.\n" if $DEBUG; } sub translate { my $IFH = shift or return; my $OFH = shift or return; my $TRANSLATIONS = shift; my $BABELFISHINGS = shift; LINE: while(my $text = <$IFH>) { # If it's ascii, then it doesn't need to be translated? my $code = getcode($text) || ''; print "Line coding: $code\n" if($code and $DEBUG); unless($code eq 'ascii') { if($code) { # Not ascii, run through Jcode. my $j = Jcode->new($text); $text = $j->utf8; } my @chunks = $text =~ m!(\S+)!g; CHUNK: for(@chunks) { my $chunk = $_; my $chunk_code = getcode($chunk) || ''; next CHUNK if($chunk_code and ($chunk_code eq 'ascii') +); $chunk =~ s!^//!!; $chunk =~ s!^#+!!; $chunk =~ s!^/\*+!!; $chunk =~ s!\*/$!!; print "Chunk: $chunk\n" if $DEBUG; my $trans; if(exists($dict{$chunk})) { if(defined($dict{$chunk})) { $trans = $dict{$chunk}; print "Dictionary: $chunk = $trans\n" if $DEBU +G; $text =~ s!\Q$chunk!$trans!; $$TRANSLATIONS++ if $TRANSLATIONS; } else { print "Skipping $chunk -- translation failed p +reviously.\n" if $DEBUG; } } else { print "\n" if $DEBUG; print "Translating: $chunk\n" if $DEBUG; $trans = $babel->translate( source => 'Japanese', destination => 'English', text => $chunk, delimiter => "\n", ); if(defined($trans)) { # Replace those annoying &nbsp;s that Babelfis +h loves. $trans =~ s!&nbsp;! !g; chomp $trans; if($trans =~ m!^\s*$!) { # Babelfish returned nothing. print "No useful translation returned.\n" +if $DEBUG; sleep 2 if $DEBUG; # Make an entry in the dict in case somebo +dy # wants to try to translate it later. $dict{$chunk} = undef; $chunk = ''; $trans = ''; } else { $$TRANSLATIONS++ if $TRANSLATIONS; $$BABELFISHINGS++ if $BABELFISHINGS; if($chunk ne $trans) { # Answer looks useful. Use it and kee +p it. $text =~ s!\Q$chunk!$trans!; $dict{$chunk} = $trans; print "Translated\n\t$chunk\nto\n\t$tr +ans\n" if $DEBUG; } else { # Store a placeholder in the dict # so we don't waste time sending it to # Babelfish again $dict{$chunk} = undef; print "Babelfish returned what we sent + it.\n" if $DEBUG; } if((my $freeze = nfreeze(\%dict)) and open(DICT, "> jtoeng.dict")) { binmode(DICT); print DICT $freeze; close(DICT); } } } else { warn "Lookup on $chunk failed.\n"; } } } } print "\n\n<" . '-' x 79 . "\n" if $DEBUG; print $text if $DEBUG; print '-' x 79 . ">\n\n" if $DEBUG; print $OFH $text; } return; }

Adding the dictionary is a bit of a hack, as I just used Storable. An enhancement would be to use a database instead, providing concurrency protection and speed improvements. The current approach, however, requires much less user setup.



--isotope
http://www.skylab.org/~isotope/

Edit - Petruchio Sun Oct 21 10:30:26 UTC 2001: Added READMORE tag.

  • Comment on Translating source code from Japanese to English using WWW::Babelfish
  • Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: CUFP [id://118941]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (3)
As of 2024-04-20 06:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found