http://qs321.pair.com?node_id=615968

isync has asked for the wisdom of the Perl Monks concerning the following question:

Hi there! (running perl 5.8.7)

I am going through the tedious work of making a script unicode and utf8 aware. Now, finally I understood the difference between unicode and utf8 and thought it needs - to really make a script multi-language aware - to process all regexes etc in perls "internal format" - wrong I was!

This is my procedure pipeline:
1. read a string from variously encoded sources --> decode it properly to get "perl's internal format"
2. do various things with the textual data
3. re-encode it to utf8 (effectively a transport/storage format) and write it to disk (in binmode).

But then, surprise surprise on step 2!
I had the following regex:
$internal_format_string =~ s/\n//g;
and it removed some letters, spaces and a lot more! Then my thought was it has to do with the string being in "internal format". So I tried:
require Encode; my $string_in_utf8 = Encode::encode_utf8($internal_format_string); $string_in_utf8 =~ s/\n//g;
and it worked again! So it seems perl requires my string to be in utf8, at least to use recognize the special \n newline char. But doesn't this prevent me from properly handling the broad range of unicode characters in the regex, on other regexes than removing the \n char? So I tried to get back to full unicode processing in my regexes:
$internal_format_string =~ s/\x{0A}//g;
Which failed (might be because I am using wrong syntax for hex operation) (or is the string not in hex but in unicode? \u{000A} failed as well..)

So what should I do?
Should I use regexes on scalars containing unicode/"internal format" data, or on scalars containing utf8 encoded data?
Should my "script-internal-standard" be decoded unicode or unicode in utf8 encoded??

(to make it all worse, the perlfaq says, the "internal format" is utf8 encoded unicode, but I should forget about that - now SHOULD it?)