Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re: Unicode: Perl5 equivalent to Perl6's @string.graphemes?

by ikegami (Patriarch)
on Nov 12, 2010 at 16:45 UTC ( [id://871090]=note: print w/replies, xml ) Need Help??


in reply to Unicode: Perl5 equivalent to Perl6's @string.graphemes?

The one thing I did that produced different results was to change the open statement to use '<:encoding(UTF-8)' for the mode instead of '<', but that transforms everything (not just the chars, but also the words) into stuff like "\x{ff11}", which does not seem useful to me.

That is the correct fix. Dumper produces Perl code, primarily for debugging purposes. When it comes to characters where encoding is likely to matter, it uses escapes to avoid mixups. As a debugging tool, it rather produce some harder to read output then producing output that looks wrong because the caller didn't properly encode the output.

If you hadn't used Dumper (just printed the string) and if you encoded your output (use open ':std', ':locale';), then you would get the actual characters.

(Each of the graphemes you posted were represented by a single character, so I didn't bother using \P{M}.)

use strict; use warnings; use open ':std', ':locale'; use Data::Dumper qw( Dumper ); my $file = do { open(my $fh, '<:encoding(UTF-8)', 'jap') or die $!; local $/; <$fh> }; print(Dumper($file)); print("[$_]") for $file =~ /(.)/sg; print("\n");
$VAR1 = "\x{6301}\x{3063}\x{3066}\x{884c}\x{304f}
";
[持][っ][て][行][く][
]

Replies are listed 'Best First'.
Re^2: Unicode: Perl5 equivalent to Perl6's @string.graphemes?
by jonadab (Parson) on Nov 12, 2010 at 23:29 UTC
    Dumper ... uses escapes to avoid mixups.

    Ah, that's what I was misunderstanding. I saw that stuff and thought the encoding handling was doing it and that that's what my data were actually looking like, which would be bad. If that's just Dumper's way of escaping non-ASCII characters, I can deal with that. Thanks a million. I thought I was going insane.

Re^2: Unicode: Perl5 equivalent to Perl6's @string.graphemes?
by Jim (Curate) on Nov 12, 2010 at 23:37 UTC

    Can you please explain why you used use open ':std', ':locale' instead of, say, use open qw( :encoding(UTF-8) :std )? What do :std and :locale together do?

    (Learning how to process Unicode text using Perl: One step forward, two steps back. Every time I think I've learned something, I haven't.)

      What do :std and :locale together do?

      The same as :std and :encoding together, just without having to specify the encoding.

      Can you please explain why you used use open ':std', ':locale'

      Because I don't know the encoding of the terminals in which your program will run. I don't even know that they all use the same encoding.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://871090]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (4)
As of 2024-04-19 22:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found