Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Orthography Translation using Regex

by Baz (Friar)
on Feb 29, 2004 at 20:18 UTC ( [id://332690]=perlquestion: print w/replies, xml ) Need Help??

Baz has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I wish to convert strings from New Irish Gaelic Orthography to the older format. First of all, I wish to convert vowels (AEIOU) followed by a forwardslash, as follows -

A/ -> Á
a/ -> á

etc.

Second I need to convert small letter s and r, as follows -
r -> ‰
s -> Š

And finally, the following -

Bh -> ¡
bh -> ¢
Ch -> ¤
ch -> ¥
Dh -> ¦
dh -> «
Fh -> °
fh -> ±
Gh -> ²
gh -> ³
Mh -> ´
mh -> µ
Ph -> ∙
ph -> ¹
Sh -> »
sh -> š
Th -> ×
th -> ÷

Don't worry about the funny charachters above, I'll be working with a different font, and charset.

As an example, the following should convert as -

Barra O/ Gri/obhtha
Ba‰‰a Ó G‰ío¢÷a
I'll also need to implement this in javascript - lucky it includes regex also.
Anyway, How might I go about doing this using replacing regular expressions?

Cheers,

Barry.

Replies are listed 'Best First'.
Re: Orthography Translation using Regex
by Roger (Parson) on Feb 29, 2004 at 20:40 UTC
    my $translation_table = { 'A' => 'A', # translate from => to 'a' => 'a', 'r' => 'r', # etc }; my $str = "Barra O Gri obhtha"; # build search pattern as 'pattern1|pattern2|...' my $patterns = join '|', keys %$translation_table; $str =~ s/($patterns)/$translation_table->{$1}/ge;
      When doing join '|', for a regex, always sort (descending) by string length, to insure you match e.g. 'sh' before 's'.

      And I suspect you want a map quotemeta in there as well. (Update: quotemeta isn't needed here; I misremembered the question as having special characters.)

        I am just giving an idea on how this could be done, not on the complete correctness of the regex though. :-)

Re: Orthography Translation using Regex
by matija (Priest) on Feb 29, 2004 at 20:46 UTC
    The simplest way to do it would be to just put in a replace expression for each replacement you had to do, like this:
    $string=s/r/‰/g;

    You need to be carefull that you first replace the strings with the longer original texts: I see you have a conflict between s and sh - you need to convert sh first, otherwise you will wind up with šh, and that's not what you want.

    Note that you could pack all the original strings and their translations into a hash, and then use a loop to translate all of them. I'm not sure that would buy you much, readability-wise - except that it could sort by the length of original string for you.

    sub bylen { length($b)<=>length($a)} # sorts by decreasing length of s +tring foreach (sort keys bylen %trans) { s/$_/$trans{$_}/g; }
    and the definitions would look like this:
    $trans{Mh}='´'; $trans{gh}='³';
    etc....
Re: Orthography Translation using Regex
by Anonymous Monk on Mar 01, 2004 at 06:10 UTC
    I needed to do a similar task for pre-Unicode Mongolian to Unicode. The hash provided is only a sample. The full script is at http://students.washington.edu/blanch/downloads/encodingConverter.pl You can drop your hash in and it should run with little modification if any.
    #!/usr/local/bin/perl # encodingConverter.pl # Duane L. Blanchard # http://students.washington.edu/blanch/downloads/ # blanch@iname.com use strict; use warnings; #use utf8; use charnames ':full'; #hash tables for each encoding must be at the top #Hash table Keys: Cyrillic Chars, Values: Unicode Char Names my %name = ( # Lowercase "à" => "\N{CYRILLIC SMALL LETTER A}", "á" => "\N{CYRILLIC SMALL LETTER BE}", "â" => "\N{CYRILLIC SMALL LETTER VE}", "ã" => "\N{CYRILLIC SMALL LETTER GHE}", # Uppercase "A" => "\N{CYRILLIC CAPITAL LETTER A}", "Á" => "\N{CYRILLIC CAPITAL LETTER BE}", "Â" => "\N{CYRILLIC CAPITAL LETTER VE}", "Ã" => "\N{CYRILLIC CAPITAL LETTER GHE}", ); # Open the input file my $inFile; until(open(OUTFILE, ">outFile.txt")) { print("\n$inFile could not be found."); } print("What file would you like to convert? \n"); $inFile = <stdin>; #query user for input file chomp $inFile; until(open(inFile, "$inFile")) { print("\n$inFile could not be found.", " Please provide the absolute path. \n"); $inFile = <stdin>; } while (<inFile>) { my $line = $_; # $_ is a line of text my @array = split ("", $line); # $_ is now a character for (@array) { if (exists $name{$_}) # check the hash for $_ { print OUTFILE $name{$_}; # print the Unicode value of $_ } else { print OUTFILE "$_"; # preserves English } } } close OUTFILE; print "\nYour converted text is in:\n", ">> outFile.txt.\n\n";
      A couple minor points about this script (unrelated to the main theme of the thread):

      First, @ARGV is your friend -- use it to get input and output file names from the command line. Here's one way to do it:

      my $Usage = "Usage: $0 infile outfile\n"; # open input and output files die $Usage unless ( @ARGV == 2 ); open( IN, $ARGV[0] ) or die "Unable to read $ARGV[0]: $!\n$Usage"; open( OUT, ">$ARGV[1]" ) or die "Unable to write $ARGV[1]: $!\n$Usage" +; ...
      You have problems in both of your "until (open(...))" loops, which would be avoided if you use @ARGV (because you don't need those loops at all). In your first "until" loop, if there ever really is a failure to open the output file, there's no exit from that loop -- not good. As for the second one (for getting an input file name), you forgot to "chomp" the user input that you read inside the loop, which means the loop will never succeed (unless a file name happens to contain a final newline character) -- also not good.

      For that matter, you could do without open statements altogether -- just use  while (<>) to read input (from a named file or from stdin), and just print to STDOUT. Let the users decide if/when to redirect these to or from a disk file (e.g. as opposed to piping data to/from other processes):

      converter.pl < some.input > some.output # or some_process | converter.pl | another_process # or any combination of the above...

      As for the main "while()" loop, it can be expressed more compactly without loss of clarity:

      while (<IN>) { my @chars = split //; for (@chars) { # $_ now holds one char per iteration my $out = ( exists $name{$_} ) ? $name{$_} : $_; print $out; } }

      Finally, you may want to look at "perldoc enc2xs", which gives a nice clear explanation about how to roll your own encoding modules that can be used in combination with Encode (i.e. on a par with "iso-8859-1" or "koi8-r"), to convert back and forth bewteen Unicode and your own particular non-Unicode character set. It's actually pretty simple, provided that your mapping involves just one character of output for each character of input (which is not true for the OP that started this thread, unfortunately).

      If you're the same Anonymous Monk who posted the first reply to the script, I don't expect this will help with the problem you mentioned (only handling small files) -- maybe you need to start your own SoPW thread on that...

      I just found that my script, which I am finishing just now, only handles short input files. I can't determine yet why.
Re: Orthography Translation using Regex
by John M. Dlugosz (Monsignor) on Mar 01, 2004 at 23:25 UTC
    Well, the hard part is that it's not one char to one char. The simplest (and portable) way would be to just have a list of all the translations and do each one to the string. BUT, as others have pointed out, that might lead to problems if the output of one matches the input of another. So careful ordering of the individual replacements might fix that, and if there is a circular one somewhere then introduce a dummy code as an intermediate.

    You can also try running the whole set at one position at a time, rather than running each translation over all positions. Without using fancy stuff like /G and setting the string's current scan postion, you can use a dummy char. For example, use * but in real life use something that is not a legal char. Start by prepending * to the string. Then your chain of replacements will be something like "*A/" to "Á*", that is, it moves the star to the next position when one is found. The last one moves it without changing the one character, and you stop when you find one that works and start over, repeating until the * is at the end.

    I would suggest, regardless, that you use numeric codes instead of visible chars in the wrong character set in the source file.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://332690]
Approved by arden
Front-paged by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (5)
As of 2024-04-16 13:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found