Re: Unicode and Regexps: convert or am I missing something?

Are you sure the text is UTF-16? If so, your best bet is probably to convert to UTF-8. Perl 5.8 and above handle utf-8 natively, and utf-8 and utf-16 have a 1-to-1 character correspondence, so there won't be any encoding issues. Use Encode to handle it easily.

use Encode;

my $string;

Encode::from_to($string, 'utf-16', 'utf-8');
[download]

UTF-8 latin digits are code equivalent to iso-8859-1 digits so the same regex should match either.

If you want to be extra sure to find Unicode digits, you can use named property assertions which will automatically use search in Unicode context.

my @digets = $string =~ /\p{Digit}+/g;
[download]

Just be aware that that regex will also find digits in other character blocks too if there are any in the string.

Comment on Re: Unicode and Regexps: convert or am I missing something? Select or Download Code

Replies are listed 'Best First'.
Re^2: Unicode and Regexps: convert or am I missing something? by newrisedesigns (Curate) on Jun 02, 2005 at 01:02 UTC
Thanks for your reply. According to the header, the returned data is UTF-16LE, which I assume stands for little-endian. I am on a Mac, so I guess I'm big-endian, which would explain why I was getting Asian glyphs instead of my Adsense results. I tried the Encode method you suggested, also with the variation 'LE' after the 16 (why not, I've tried everything else, it seems) but it didn't work. The \p{Digit} does match, but fails when used in conjunction with the date field separator (/) like so: `\p{Digit}\\/`. I guess the problem comes down to endian-ness of the data returned. How do I flip flop the data so that the methods available to me (Encode:: and /usr/bin/iconv) will work for me?	[reply] [d/l]
Re^3: Unicode and Regexps: convert or am I missing something? by thundergnat (Deacon) on Jun 02, 2005 at 01:29 UTC
UTF-16LE is supported by the Encoding module, so it should work... Did you try down converting it to Latin-1? The less often used encodings don't have as many aliases, you may need to be more careful about how the encoding is specified. `Encode::from_to($string, 'UTF-16LE', 'utf8');` [download] should be ok, as should `Encode::from_to($string, 'UTF-16LE', 'iso-8859-1');` [download] You only need to single escape the forward slash in the regex. (Or use alternate delimiters.) `my $string = '5/18/05 184 7 3.8% 6.14 1.13'; if ($string =~ m#(\p{Digit}+/\p{Digit}+/\p{Digit}+)#){ print $1; }` [download]	[reply] [d/l] [select]
Re^3: Unicode and Regexps: convert or am I missing something? by dakkar (Hermit) on Jun 02, 2005 at 10:13 UTC
`use Encode; my $string=Encode::decode('UTF-16LE',$data_from_google); $string=~/what you want/;` [download] `from_to` is the wrong function to use. It converts between byte strings, but to correctly work with regexp you need character strings, so you need to use `decode` -- dakkar - Mobilis in mobile Most of my code is tested... Perl is strongly typed, it just has very few types (Dan)	[reply] [d/l]


P is for Practical
	PerlMonks