Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re: Unicode and Regexps: convert or am I missing something?

by thundergnat (Deacon)
on Jun 02, 2005 at 00:24 UTC ( [id://462710]=note: print w/replies, xml ) Need Help??


in reply to Unicode and Regexps: convert or am I missing something?

Are you sure the text is UTF-16? If so, your best bet is probably to convert to UTF-8. Perl 5.8 and above handle utf-8 natively, and utf-8 and utf-16 have a 1-to-1 character correspondence, so there won't be any encoding issues. Use Encode to handle it easily.

use Encode; my $string; Encode::from_to($string, 'utf-16', 'utf-8');

UTF-8 latin digits are code equivalent to iso-8859-1 digits so the same regex should match either.

If you want to be extra sure to find Unicode digits, you can use named property assertions which will automatically use search in Unicode context.

my @digets = $string =~ /\p{Digit}+/g;

Just be aware that that regex will also find digits in other character blocks too if there are any in the string.

Replies are listed 'Best First'.
Re^2: Unicode and Regexps: convert or am I missing something?
by newrisedesigns (Curate) on Jun 02, 2005 at 01:02 UTC

    Thanks for your reply.

    According to the header, the returned data is UTF-16LE, which I assume stands for little-endian. I am on a Mac, so I guess I'm big-endian, which would explain why I was getting Asian glyphs instead of my Adsense results.

    I tried the Encode method you suggested, also with the variation 'LE' after the 16 (why not, I've tried everything else, it seems) but it didn't work. The \p{Digit} does match, but fails when used in conjunction with the date field separator (/) like so: \p{Digit}\\/.

    I guess the problem comes down to endian-ness of the data returned. How do I flip flop the data so that the methods available to me (Encode:: and /usr/bin/iconv) will work for me?

      UTF-16LE is supported by the Encoding module, so it should work... Did you try down converting it to Latin-1? The less often used encodings don't have as many aliases, you may need to be more careful about how the encoding is specified.

      Encode::from_to($string, 'UTF-16LE', 'utf8');

      should be ok, as should

      Encode::from_to($string, 'UTF-16LE', 'iso-8859-1');

      You only need to single escape the forward slash in the regex. (Or use alternate delimiters.)

      my $string = '5/18/05 184 7 3.8% 6.14 1.13'; if ($string =~ m#(\p{Digit}+/\p{Digit}+/\p{Digit}+)#){ print $1; }
      use Encode; my $string=Encode::decode('UTF-16LE',$data_from_google); $string=~/what you want/;

      from_to is the wrong function to use. It converts between byte strings, but to correctly work with regexp you need character strings, so you need to use decode

      -- 
              dakkar - Mobilis in mobile
      

      Most of my code is tested...

      Perl is strongly typed, it just has very few types (Dan)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://462710]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (5)
As of 2024-04-24 18:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found