Orthography Translation using Regex

Baz has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Orthography Translation using Regex by Roger (Parson) on Feb 29, 2004 at 20:40 UTC
`my $translation_table = { 'A' => 'A', # translate from => to 'a' => 'a', 'r' => 'r', # etc }; my $str = "Barra O Gri obhtha"; # build search pattern as 'pattern1\|pattern2\|...' my $patterns = join '\|', keys %$translation_table; $str =~ s/($patterns)/$translation_table->{$1}/ge;` [download]	[reply] [d/l]
Re: Re: Orthography Translation using Regex by ysth (Canon) on Feb 29, 2004 at 20:57 UTC
When doing `join '\|',` for a regex, always sort (descending) by string length, to insure you match e.g. 'sh' before 's'. And I suspect you want a map quotemeta in there as well. (Update: quotemeta isn't needed here; I misremembered the question as having special characters.)	[reply] [d/l]
Re: Re: Re: Orthography Translation using Regex by Roger (Parson) on Feb 29, 2004 at 21:04 UTC
I am just giving an idea on how this could be done, not on the complete correctness of the regex though. :-)	[reply]
Re: Orthography Translation using Regex by matija (Priest) on Feb 29, 2004 at 20:46 UTC
The simplest way to do it would be to just put in a replace expression for each replacement you had to do, like this: `$string=s/r/‰/g;` [download] You need to be carefull that you first replace the strings with the longer original texts: I see you have a conflict between s and sh - you need to convert sh first, otherwise you will wind up with šh, and that's not what you want. Note that you could pack all the original strings and their translations into a hash, and then use a loop to translate all of them. I'm not sure that would buy you much, readability-wise - except that it could sort by the length of original string for you. `sub bylen { length($b)<=>length($a)} # sorts by decreasing length of s +tring foreach (sort keys bylen %trans) { s/$_/$trans{$_}/g; }` [download] and the definitions would look like this: `$trans{Mh}='´'; $trans{gh}='ł';` [download] etc....	[reply] [d/l] [select]
Re: Orthography Translation using Regex by Anonymous Monk on Mar 01, 2004 at 06:10 UTC
I needed to do a similar task for pre-Unicode Mongolian to Unicode. The hash provided is only a sample. The full script is at http://students.washington.edu/blanch/downloads/encodingConverter.pl You can drop your hash in and it should run with little modification if any. #!/usr/local/bin/perl # encodingConverter.pl # Duane L. Blanchard # http://students.washington.edu/blanch/downloads/ # blanch@iname.com use strict; use warnings; #use utf8; use charnames ':full'; #hash tables for each encoding must be at the top #Hash table Keys: Cyrillic Chars, Values: Unicode Char Names my %name = ( # Lowercase "ŕ" => "\N{CYRILLIC SMALL LETTER A}", "á" => "\N{CYRILLIC SMALL LETTER BE}", "â" => "\N{CYRILLIC SMALL LETTER VE}", "ă" => "\N{CYRILLIC SMALL LETTER GHE}", # Uppercase "A" => "\N{CYRILLIC CAPITAL LETTER A}", "Á" => "\N{CYRILLIC CAPITAL LETTER BE}", "Â" => "\N{CYRILLIC CAPITAL LETTER VE}", "Ă" => "\N{CYRILLIC CAPITAL LETTER GHE}", ); # Open the input file my $inFile; until(open(OUTFILE, ">outFile.txt")) { print("\n$inFile could not be found."); } print("What file would you like to convert? \n"); $inFile = <stdin>; #query user for input file chomp $inFile; until(open(inFile, "$inFile")) { print("\n$inFile could not be found.", " Please provide the absolute path. \n"); $inFile = <stdin>; } while (<inFile>) { my $line = $_; # $_ is a line of text my @array = split ("", $line); # $_ is now a character for (@array) { if (exists $name{$_}) # check the hash for $_ { print OUTFILE $name{$_}; # print the Unicode value of $_ } else { print OUTFILE "$_"; # preserves English } } } close OUTFILE; print "\nYour converted text is in:\n", ">> outFile.txt.\n\n"; [download]	[reply] [d/l]
Re: Re: Orthography Translation using Regex by graff (Chancellor) on Mar 01, 2004 at 09:15 UTC
A couple minor points about this script (unrelated to the main theme of the thread): First, @ARGV is your friend -- use it to get input and output file names from the command line. Here's one way to do it: `my $Usage = "Usage: $0 infile outfile\n"; # open input and output files die $Usage unless ( @ARGV == 2 ); open( IN, $ARGV[0] ) or die "Unable to read $ARGV[0]: $!\n$Usage"; open( OUT, ">$ARGV[1]" ) or die "Unable to write $ARGV[1]: $!\n$Usage" +; ...` [download] You have problems in both of your "until (open(...))" loops, which would be avoided if you use @ARGV (because you don't need those loops at all). In your first "until" loop, if there ever really is a failure to open the output file, there's no exit from that loop -- not good. As for the second one (for getting an input file name), you forgot to "chomp" the user input that you read inside the loop, which means the loop will never succeed (unless a file name happens to contain a final newline character) -- also not good. For that matter, you could do without open statements altogether -- just use `while (<>)` to read input (from a named file or from stdin), and just print to STDOUT. Let the users decide if/when to redirect these to or from a disk file (e.g. as opposed to piping data to/from other processes): `converter.pl < some.input > some.output # or some_process \| converter.pl \| another_process # or any combination of the above...` [download] As for the main "while()" loop, it can be expressed more compactly without loss of clarity: `while (<IN>) { my @chars = split //; for (@chars) { # $_ now holds one char per iteration my $out = ( exists $name{$_} ) ? $name{$_} : $_; print $out; } }` [download] Finally, you may want to look at "perldoc enc2xs", which gives a nice clear explanation about how to roll your own encoding modules that can be used in combination with Encode (i.e. on a par with "iso-8859-1" or "koi8-r"), to convert back and forth bewteen Unicode and your own particular non-Unicode character set. It's actually pretty simple, provided that your mapping involves just one character of output for each character of input (which is not true for the OP that started this thread, unfortunately). If you're the same Anonymous Monk who posted the first reply to the script, I don't expect this will help with the problem you mentioned (only handling small files) -- maybe you need to start your own SoPW thread on that...	[reply] [d/l] [select]
Re: Re: Orthography Translation using Regex by Anonymous Monk on Mar 01, 2004 at 06:25 UTC
I just found that my script, which I am finishing just now, only handles short input files. I can't determine yet why.	[reply]
Re: Orthography Translation using Regex by John M. Dlugosz (Monsignor) on Mar 01, 2004 at 23:25 UTC
Well, the hard part is that it's not one char to one char. The simplest (and portable) way would be to just have a list of all the translations and do each one to the string. BUT, as others have pointed out, that might lead to problems if the output of one matches the input of another. So careful ordering of the individual replacements might fix that, and if there is a circular one somewhere then introduce a dummy code as an intermediate. You can also try running the whole set at one position at a time, rather than running each translation over all positions. Without using fancy stuff like /G and setting the string's current scan postion, you can use a dummy char. For example, use * but in real life use something that is not a legal char. Start by prepending * to the string. Then your chain of replacements will be something like "A/" to "Á", that is, it moves the star to the next position when one is found. The last one moves it without changing the one character, and you stop when you find one that works and start over, repeating until the * is at the end. I would suggest, regardless, that you use numeric codes instead of visible chars in the wrong character set in the source file.	[reply]


Syntactic Confectionery Delight
	PerlMonks