Current Perl documentation can be found at perldoc.perl.org.
Here is our local, out-dated (pre-5.6) version:
This is hard, and there's no good way. Perl does not directly support wide characters. It pretends that a byte and a character are synonymous. The following set of approaches was offered by Jeffrey Friedl, whose article in issue #5 of The Perl Journal talks about this very matter.
Let's suppose you have some weird Martian encoding where pairs of ASCII uppercase letters encode single Martian letters (i.e. the two bytes ``CV'' make a single Martian letter, as do the two bytes ``SG'', ``VS'', ``XX'', etc.). Other bytes represent single characters, just like ASCII.
So, the string of Martian ``I am CVSGXX!'' uses 12 bytes to encode the nine characters 'I', ' ', 'a', 'm', ' ', 'CV', 'SG', 'XX', '!'.
Now, say you want to search for the single character /GX/
. Perl doesn't know about Martian, so it'll find the two bytes
``GX'' in the
``I am
CVSGXX!'' string, even though that character isn't there: it just looks like it is because
``SG'' is next to
``XX'', but there's no real
``GX''. This is a big problem.
Here are a few ways, all painful, to deal with it:
$martian =~ s/([A-Z][A-Z])/ $1 /g; # Make sure adjacent ``martian'' bytes # are no longer adjacent. print "found GX!\n" if $martian =~ /GX/;
Or like this:
@chars = $martian =~ m/([A-Z][A-Z]|[^A-Z])/g; # above is conceptually similar to: @chars = $text =~ m/(.)/g; # foreach $char (@chars) { print "found GX!\n", last if $char eq 'GX'; }
Or like this:
while ($martian =~ m/\G([A-Z][A-Z]|.)/gs) { # \G probably unneeded print "found GX!\n", last if $1 eq 'GX'; }
Or like this:
die "sorry, Perl doesn't (yet) have Martian support )-:\n";
In addition, a sample program which converts half-width to full-width katakana (in Shift-JIS or EUC encoding) is available from CPAN as
There are many double- (and multi-) byte encodings commonly used these days. Some versions of these have 1-, 2-, 3-, and 4-byte characters, all mixed.