problems with extended ascii characters in filenames

zentara has asked for the wisdom of the Perl Monks concerning the following question:

Hi, while testing Tk-thumbnail-viewer , I came across a problem, which I can't resolve. If you descend into the image subdir art/Paintings/Van_Gogh , there is a file named

Van_Gogh__Landscape_at_Saint_RÃ©my .png
[download]

as seen in Midnight Commander. If I do a dir, I see

Van_Gogh__Landscape_at_Saint_R\303\251my\ .png
[download]

Needless to say, when I try to open the file in my Tk program, it complains that "no such file exists". There are a few other files named this way, in the clipart package, and I want to know how to deal with them, so they will load in Perl.

My c based apps seem to find it.

I tried it from a gtk2 version of my viewer, and I get

***   Failed to open file './wpclipart_src-1.7/art/Paintings/Van_Gogh/
+Van_Gogh__Landscape_at_Saint_RÃÂ©my .png': No such file or directory 
+at ./gtk2_thumbnail_viewer1 line 272.
[download]

Can anyone explain this for me?

I'm not really a human, but I play one on earth. flash japh

Comment on problems with extended ascii characters in filenames Select or Download Code

Replies are listed 'Best First'.
Re: problems with extended ascii characters in filenames by graff (Chancellor) on Mar 16, 2006 at 02:11 UTC
Somehow the RÃ©my turned into RÃÂ©my ?? The original file name (as read from the directory entry) has a single "é" character, small letter e with acute, encoded in utf8 (which makes it a two-byte sequence: 0xc3 0xa9). The way it turns into the apparent three character sequence you see can be demonstrated like this: `perl -CS -e 'print chr(0xc3),chr(0xa9)' \| od -txC # outputs: 0000000 c3 83 c2 a9 + 0000004` [download] (The one-liner is taking the two single-byte "characters", and converting them to utf8 byte sequences on output.) Now, if you look at that four-byte sequence on an iso-8859-1 type of display window, you'll see only the three characters you mentioned (Ã Â ©), because those are the letters associated in 8859-1 with ~~those~~ the first, third and fourth byte values. The second byte of the four (0x83) is supposed to be a control character of some sort in 8859, and won't be visible at all (the display window just ignores it). So the problem is that a utf8 wide character contained in a directory entry is being treated as if it were iso-8859-1 (or cp1252, which would be equivalent for the original two byte sequence), and is being converted to utf8. You may be able to keep that from happening, by flagging the file name as being a utf8 string yourself, as soon as you read it from the directory -- e.g.: `use Encode; opendir( D, $path ); @datafiles = grep { -f }, readdir( D ); $_ = decode( 'utf8', $_ ) for ( @datafiles );` [download] The decode function won't really alter the file name strings at all (unless there happen to be bytes that are neither ASCII nor part of a valid utf8 character); it simply sets the utf8 flag on the scalars holding the strings. Once perl knows the strings are utf8 (because they are flagged as such), nothing else downstream is likely to convert them to utf8 again and thereby screw them up (which is what is happening now). (updated to fix grammar) (also updated last code snippet, to fix array name)	[reply] [d/l] [select]
Re^2: problems with extended ascii characters in filenames by wfsp (Abbot) on Mar 16, 2006 at 07:12 UTC
graff++ I have similar fights with HTML. While I know the source of the problem is utf8 related I've never really been able to get to the bottom of it. I still come across both RÃ©my and RÃÂ©my! This is the first time I've come across such a clear, straightforward explanation of what is actually happening. Hopefully, armed with your insights, I now have at least half a chance of avoiding these "screw ups" in future. Many thanks! wfsp	[reply]
Re^3: problems with extended ascii characters in filenames by fraktalisman (Hermit) on Mar 16, 2006 at 13:14 UTC
As for HTML and Perl source code: Once you start using UTF-8 here, you must not re-save the same files from text editors which do not yet support UTF-8, otherwise the extended characters in the source text get messed up. There are unfortunately still quite a lot of programs which only support Latin-1 (iso-8859-1) encoding. In HTML, you could get around the problem with the classic solution of the nineties: writing HTML entities, like é for é etc. For the same backward compatibility reason, I usually avoid any non-ASCII character (i.e. ord($char)>127) in filenames. _{fraktalisman keeps rolling}	[reply]
Re^2: problems with extended ascii characters in filenames by saberworks (Curate) on Mar 16, 2006 at 17:31 UTC
There are some good articles about character encoding here: http://www.joelonsoftware.com/articles/Unicode.html http://www.phpwact.org/php/i18n/charsets	[reply]
Re^2: problems with extended ascii characters in filenames by zentara (Archbishop) on Mar 16, 2006 at 15:57 UTC
Thanks for the excellent lesson, graff. It works. I was wondering why you used the syntax `$_ = decode( 'utf8', $_ ) for ( @datafiles ); #instead of @files = map { decode( 'utf8', $_ ) } @files;` [download] Does it matter? I'm not really a human, but I play one on earth. flash japh	[reply] [d/l]
Re^3: problems with extended ascii characters in filenames by graff (Chancellor) on Mar 16, 2006 at 22:57 UTC
It ~~probably~~ doesn't matter.	[reply]
Re: problems with extended ascii characters in filenames by duckyd (Hermit) on Mar 17, 2006 at 00:42 UTC
there's a pretty good explanation on some guy's blog too	[reply]
Re^2: problems with extended ascii characters in filenames by zentara (Archbishop) on Mar 17, 2006 at 11:35 UTC
Yeah, there a 3 informative Perl related entries there. I'm not really a human, but I play one on earth. flash japh	[reply]


"be consistent"
	PerlMonks