Come for the quick hacks, stay for the epiphanies. | |
PerlMonks |
Japanese filenames and USING_WIDE in win32.hby almut (Canon) |
on Nov 14, 2006 at 22:28 UTC ( [id://584078]=perlquestion: print w/replies, xml ) | Need Help?? |
almut has asked for the wisdom of the Perl Monks concerning the following question: Dear Monks, sorry to bother you once again with my "win32 japanese filenames" problem -- I'm still struggling with the same old issue... Short recap: for various reasons I need to upgrade a Japanese site with a rather large codebase to a current Perl. At the moment they're still using jperl, based on v5.005. Ideally, they wouldn't have to modify any existing scripts, so the idea is to provide a compatibility module which makes Perl-5.8.8 emulate jperl behaviour as closely as possible. jperl apparently does not use unicode internally, so a number of encoding related issues I'm trying to cope with now, just didn't exist. The most prominent problem is dealing with filenames. After having asked here for better ideas, I decided to try wrapping all built-ins that take or return filenames (example here) . However, this turned out to be more difficult than I'd hoped, mainly because
So, looking for alternative approaches, I started digging through the 5.8.8 sources and actually did find a #define which at first looked like an almost ideal solution to my problem: In win32/win32.h:477 there's
Setting this to '1' enables code which calls MultiByteToWideChar() under the hood (and WideCharToMultiByte() for the other direction) in all the relevant places of the win32 specific code. As far as I can tell, this function is being fed UTF-8 strings, so calling MultiByteToWideChar(CP_UTF8, ... ) (as is being done) would seem to do the Right Thing -- at least in my case, with use encoding 'cp932' in the scripts. In fact, it works pretty well (though I might not yet have discovered some obscure pitfalls). Apparently, Windows can take both a wide-char unicode (UCS-2) string and a legacy encoded (CP932) byte sequence as filename. It seems some internal conversions are going on depending on whether the API function is being fed a wide-char string or not. It can't handle UTF-8 strings, though -- which is why the filename ends up as garbage when USING_WIDE is not enabled. In that case I'd have to do explicit manual conversions to CP932 (the unsuccessful wrapper approach mentioned above). In short, the situation is as follows:
(similarly the other way round, of course, when filenames are being received from the OS) Not having to convert the filenames back to CP932 every time surely looks like the better solution, because then, no wrappers are necessary (except for system() and the like, which are unaffected by USING_WIDE 3). Unfortunately, USING_WIDE is deprecated, and that "dead code" has apparently already been removed from the development branch. I understand that this code is a leftover from previous Perl releases, where a different approach to unicode support was being tried, etc. But why remove it entirely, without replacing it with something more appropriate? Judging from my current difficulties in the Japanese Windows environment, it doesn't quite look like we could say "we don't need that any longer now"... In a related thread it was suggested to use Win32API::File instead. However, although the module does provide some wide-character functionality, all in all it doesn't seem to be applicable to my specific requirements (if you know better, please show me how). To sum up, even if I could still make use of USING_WIDE in 5.8.8, it doesn't seen like a good idea, due to foreseeable maintainability issues in the long run. So, I'm kinda back at square one... :) Essentially, my Japanese folks would just like to be able to do basic things like
And, as someone generally advocating Perl, I'd rather not have to admit "this cannot be done in Perl", by telling them to resort to writing
i.e. calling a conversion routine in each and every place where some Perl built-in involving filenames is being used.4 For one, this would mean that all existing jperl scripts would have to be modified (and tested again). Secondly, this doesn't exactly look like the most elegant abstraction you could think of... ;) Anyway, what I'm dreaming of is something like being able to say use filenames "cp932" or use filenames "utf8" and then having Perl automagically take care of all necessary conversions behind the scenes whenever filenames are being passed to/from the OS. Somewhat like you can say use encoding "cp932" to have Perl parse the script source correctly. I so far haven't found a way to achieve something similar. But hopefully it's just me not getting it... If so, please enlighten me! The arguments I've found are typically along the lines of filenames being external to Perl, and thus not being subject to what Perl could or should take care of. However, I don't see in what way filenames are any more "external to Perl" than the contents of files (for which there is the very neat and flexible PerlIO layer). I don't think we need a fully automatic approach to handling filenames (i.e. autodetecting what encodings are being used and such), just a moderately convenient way to configure it... Does anyone know what the future plans in Perl development are in this regard? Sorry about the length, and thanks for reading this far :) ________ 1 for example, the syntax of the system() built-in cannot be expressed as a perl prototype (in particular the "indirect object" syntax without a comma after the first argument) 2 actually, a related patch had been posted to p5p, but apparently it didn't get accepted (due to yet unresolved prototyping issues, it seems). 3 system(), exec() and qx() belong to a somewhat different category. Here, it's not clear what argument (or part thereof) could possibly contain a filename. So, doing automatic conversions might not necessarily be what you'd want to happen by default... 4 of course, this could typically be simplified somewhat:
but then you'd have to carefully think about when exactly to convert the strings, because from that point onwards you can no longer work with them in a character-based fashion, as needed in regex matching, etc. Additionally, you'd have to be wary to not inadvertendly upgrade strings back to utf8, when concatenating them with other strings.
Back to
Seekers of Perl Wisdom
|
|