comment on

Somehow the RÃ©my turned into RÃÂ©my ??

The original file name (as read from the directory entry) has a single "é" character, small letter e with acute, encoded in utf8 (which makes it a two-byte sequence: 0xc3 0xa9).

The way it turns into the apparent three character sequence you see can be demonstrated like this:

perl -CS -e 'print chr(0xc3),chr(0xa9)' | od -txC

# outputs:

0000000    c3  83  c2  a9                                             
+   
0000004
[download]

(The one-liner is taking the two single-byte "characters", and converting them to utf8 byte sequences on output.)

Now, if you look at that four-byte sequence on an iso-8859-1 type of display window, you'll see only the three characters you mentioned (Ã Â ©), because those are the letters associated in 8859-1 with ~~those~~ the first, third and fourth byte values. The second byte of the four (0x83) is supposed to be a control character of some sort in 8859, and won't be visible at all (the display window just ignores it).

So the problem is that a utf8 wide character contained in a directory entry is being treated as if it were iso-8859-1 (or cp1252, which would be equivalent for the original two byte sequence), and is being converted to utf8.

You may be able to keep that from happening, by flagging the file name as being a utf8 string yourself, as soon as you read it from the directory -- e.g.:

use Encode;

opendir( D, $path );
@datafiles = grep { -f }, readdir( D );
$_ = decode( 'utf8', $_ ) for ( @datafiles );
[download]

The decode function won't really alter the file name strings at all (unless there happen to be bytes that are neither ASCII nor part of a valid utf8 character); it simply sets the utf8 flag on the scalars holding the strings.

Once perl knows the strings are utf8 (because they are flagged as such), nothing else downstream is likely to convert them to utf8 again and thereby screw them up (which is what is happening now).

(updated to fix grammar)
(also updated last code snippet, to fix array name)

In reply to Re: problems with extended ascii characters in filenames by graff
in thread problems with extended ascii characters in filenames by zentara

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


good chemistry is complicated, and a little bit messy -LW
	PerlMonks