![]() |
|
Just another Perl shrine | |
PerlMonks |
How can I match strings with multibyte characters?by faq_monk (Initiate) |
on Oct 08, 1999 at 00:25 UTC ( [id://677]=perlfaq nodetype: print w/replies, xml ) | Need Help?? |
Current Perl documentation can be found at perldoc.perl.org. Here is our local, out-dated (pre-5.6) version: This is hard, and there's no good way. Perl does not directly support wide characters. It pretends that a byte and a character are synonymous. The following set of approaches was offered by Jeffrey Friedl, whose article in issue #5 of The Perl Journal talks about this very matter. Let's suppose you have some weird Martian encoding where pairs of ASCII uppercase letters encode single Martian letters (i.e. the two bytes ``CV'' make a single Martian letter, as do the two bytes ``SG'', ``VS'', ``XX'', etc.). Other bytes represent single characters, just like ASCII. So, the string of Martian ``I am CVSGXX!'' uses 12 bytes to encode the nine characters 'I', ' ', 'a', 'm', ' ', 'CV', 'SG', 'XX', '!'.
Now, say you want to search for the single character Here are a few ways, all painful, to deal with it:
$martian =~ s/([A-Z][A-Z])/ $1 /g; # Make sure adjacent ``martian'' bytes # are no longer adjacent. print "found GX!\n" if $martian =~ /GX/; Or like this:
@chars = $martian =~ m/([A-Z][A-Z]|[^A-Z])/g; # above is conceptually similar to: @chars = $text =~ m/(.)/g; # foreach $char (@chars) { print "found GX!\n", last if $char eq 'GX'; } Or like this:
while ($martian =~ m/\G([A-Z][A-Z]|.)/gs) { # \G probably unneeded print "found GX!\n", last if $1 eq 'GX'; } Or like this:
die "sorry, Perl doesn't (yet) have Martian support )-:\n"; In addition, a sample program which converts half-width to full-width katakana (in Shift-JIS or EUC encoding) is available from CPAN as There are many double- (and multi-) byte encodings commonly used these days. Some versions of these have 1-, 2-, 3-, and 4-byte characters, all mixed.
|
|