comment on

Hi there! (running perl 5.8.7)

I am going through the tedious work of making a script unicode and utf8 aware. Now, finally I understood the difference between unicode and utf8 and thought it needs - to really make a script multi-language aware - to process all regexes etc in perls "internal format" - wrong I was!

This is my procedure pipeline:
1. read a string from variously encoded sources --> decode it properly to get "perl's internal format"
2. do various things with the textual data
3. re-encode it to utf8 (effectively a transport/storage format) and write it to disk (in binmode).

But then, surprise surprise on step 2!
I had the following regex:

$internal_format_string =~ s/\n//g;
[download]

and it removed some letters, spaces and a lot more! Then my thought was it has to do with the string being in "internal format". So I tried:

require Encode;
my $string_in_utf8 = Encode::encode_utf8($internal_format_string);
$string_in_utf8 =~ s/\n//g;
[download]

and it worked again! So it seems perl requires my string to be in utf8, at least to use recognize the special \n newline char. But doesn't this prevent me from properly handling the broad range of unicode characters in the regex, on other regexes than removing the \n char? So I tried to get back to full unicode processing in my regexes:

$internal_format_string =~ s/\x{0A}//g;
[download]

Which failed (might be because I am using wrong syntax for hex operation) (or is the string not in hex but in unicode? \u{000A} failed as well..)

So what should I do?
Should I use regexes on scalars containing unicode/"internal format" data, or on scalars containing utf8 encoded data?
Should my "script-internal-standard" be decoded unicode or unicode in utf8 encoded??

(to make it all worse, the perlfaq says, the "internal format" is utf8 encoded unicode, but I should forget about that - now SHOULD it?)

In reply to The unicode / utf8 struggle, part 2: regexes by isync

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


No such thing as a small change
	PerlMonks