Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Japanese filenames and USING_WIDE in win32.h

by almut (Canon)
on Nov 14, 2006 at 22:28 UTC ( [id://584078]=perlquestion: print w/replies, xml ) Need Help??

almut has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

sorry to bother you once again with my "win32 japanese filenames" problem -- I'm still struggling with the same old issue...

Short recap: for various reasons I need to upgrade a Japanese site with a rather large codebase to a current Perl. At the moment they're still using jperl, based on v5.005. Ideally, they wouldn't have to modify any existing scripts, so the idea is to provide a compatibility module which makes Perl-5.8.8 emulate jperl behaviour as closely as possible. jperl apparently does not use unicode internally, so a number of encoding related issues I'm trying to cope with now, just didn't exist. The most prominent problem is dealing with filenames.

After having asked here for better ideas, I decided to try wrapping all built-ins that take or return filenames (example here) . However, this turned out to be more difficult than I'd hoped, mainly because

  • not all filename related built-ins can be overridden in a way that would make the wrapper behave identically to the built-in 1
  • the filetest built-ins (-d, -f, -r, etc.) are not overridable at all 2

So, looking for alternative approaches, I started digging through the 5.8.8 sources and actually did find a #define which at first looked like an almost ideal solution to my problem: In win32/win32.h:477 there's

#define USING_WIDE() (0)

Setting this to '1' enables code which calls MultiByteToWideChar() under the hood (and WideCharToMultiByte() for the other direction) in all the relevant places of the win32 specific code. As far as I can tell, this function is being fed UTF-8 strings, so calling MultiByteToWideChar(CP_UTF8, ... ) (as is being done) would seem to do the Right Thing -- at least in my case, with use encoding 'cp932' in the scripts. In fact, it works pretty well (though I might not yet have discovered some obscure pitfalls).

Apparently, Windows can take both a wide-char unicode (UCS-2) string and a legacy encoded (CP932) byte sequence as filename. It seems some internal conversions are going on depending on whether the API function is being fed a wide-char string or not. It can't handle UTF-8 strings, though -- which is why the filename ends up as garbage when USING_WIDE is not enabled. In that case I'd have to do explicit manual conversions to CP932 (the unsuccessful wrapper approach mentioned above).

In short, the situation is as follows:

  • literal filenames in the script are specified in legacy encoding CP932
  • upon reading the script, Perl converts them to unicode (due to use encoding 'cp932')
  • when Perl needs to pass those filenames to the OS, we currently have three options:
    • default behaviour: Perl just passes through its UTF-8 strings without any conversion (doesn't work)
    • explicitly coded conversions to legacy encoding
    • automatic conversion to wide-char with USING_WIDE enabled

(similarly the other way round, of course, when filenames are being received from the OS)

Not having to convert the filenames back to CP932 every time surely looks like the better solution, because then, no wrappers are necessary (except for system() and the like, which are unaffected by USING_WIDE 3).

Unfortunately, USING_WIDE is deprecated, and that "dead code" has apparently already been removed from the development branch. I understand that this code is a leftover from previous Perl releases, where a different approach to unicode support was being tried, etc.   But why remove it entirely, without replacing it with something more appropriate? Judging from my current difficulties in the Japanese Windows environment, it doesn't quite look like we could say "we don't need that any longer now"...

In a related thread it was suggested to use Win32API::File instead. However, although the module does provide some wide-character functionality, all in all it doesn't seem to be applicable to my specific requirements (if you know better, please show me how).

To sum up, even if I could still make use of USING_WIDE in 5.8.8, it doesn't seen like a good idea, due to foreseeable maintainability issues in the long run. So, I'm kinda back at square one... :)

Essentially, my Japanese folks would just like to be able to do basic things like

my $path = "some-path-in-japanese-CP932"; if (-d $path) { chdir $path; my $tmpfile = "some-name-in-CP932"; system("some_cmd \"other-filename-in-CP932\" $tmpfile"); unlink $tmpfile if !-s $tmpfile; # etc... }

And, as someone generally advocating Perl, I'd rather not have to admit "this cannot be done in Perl", by telling them to resort to writing

use Encoding; sub U2A { return encode('cp932', shift) } my $path = "some-path-in-japanese-CP932"; if (-d U2A($path)) { chdir U2A($path); my $tmpfile = "some-name-in-CP932"; system(U2A("some_cmd \"other-filename-in-CP932\" $tmpfile")); unlink U2A($tmpfile) if !-s U2A($tmpfile); # etc... }

i.e. calling a conversion routine in each and every place where some Perl built-in involving filenames is being used.4

For one, this would mean that all existing jperl scripts would have to be modified (and tested again). Secondly, this doesn't exactly look like the most elegant abstraction you could think of... ;)

Anyway, what I'm dreaming of is something like being able to say use filenames "cp932" or use filenames "utf8" and then having Perl automagically take care of all necessary conversions behind the scenes whenever filenames are being passed to/from the OS. Somewhat like you can say use encoding "cp932" to have Perl parse the script source correctly.

I so far haven't found a way to achieve something similar. But hopefully it's just me not getting it... If so, please enlighten me!

The arguments I've found are typically along the lines of filenames being external to Perl, and thus not being subject to what Perl could or should take care of. However, I don't see in what way filenames are any more "external to Perl" than the contents of files (for which there is the very neat and flexible PerlIO layer).

I don't think we need a fully automatic approach to handling filenames (i.e. autodetecting what encodings are being used and such), just a moderately convenient way to configure it... Does anyone know what the future plans in Perl development are in this regard?

Sorry about the length, and thanks for reading this far :)
Almut

________

1  for example, the syntax of the system() built-in cannot be expressed as a perl prototype (in particular the "indirect object" syntax without a comma after the first argument)

2  actually, a related patch had been posted to p5p, but apparently it didn't get accepted (due to yet unresolved prototyping issues, it seems).

3  system(), exec() and qx() belong to a somewhat different category. Here, it's not clear what argument (or part thereof) could possibly contain a filename. So, doing automatic conversions might not necessarily be what you'd want to happen by default...

4  of course, this could typically be simplified somewhat:

... my $path = U2A("some-path-in-japanese-CP932"); if (-d $path) { chdir $path; my $tmpfile = U2A("some-name-in-CP932"); system(U2A("some_cmd \"other-filename-in-CP932\" ").$tmpfile); unlink $tmpfile if !-s $tmpfile; }

but then you'd have to carefully think about when exactly to convert the strings, because from that point onwards you can no longer work with them in a character-based fashion, as needed in regex matching, etc. Additionally, you'd have to be wary to not inadvertendly upgrade strings back to utf8, when concatenating them with other strings.

Replies are listed 'Best First'.
Re: Japanese filenames and USING_WIDE in win32.h
by demerphq (Chancellor) on Nov 15, 2006 at 00:14 UTC

    The entire Win32 abstraction layer for perl is implemented in win32/Win32.c and win32/Win32io.c. Ideally you would put together a set of patches against those files that would allow a pragma to control the behaviour of widechars. For instance

    use widechar;

    would do the trick. The problem you are going to face is that there are only about three or four active Perl develops who are on Win32, and they are unlikely to undertake this stuff without help from an interested party who can do things like test. So for instance if I saw you post a set of patches aimed at this objective, but not quite polished enough to be applied Id probably run with the ball and help you get them bedded down. The fact that USING_WIDE was removed probably indicates that whomever was most knowledgable about the feature felt it was dangerous to leave it in. Which IMO suggests that there is an alternate approach that would be fine.

    From what I can see, the best way to get your clients what they want to is to get it done in perl itself. Which means interacting with the perl5porters list. Then when 5.10 comes out (this Christmas hopefully) they can use the nice new shiny stuff that you helped put in it. :-)

    ---
    $world=~s/war/peace/g

      Thank you very much for your encouragement, demerphq.

      I'll think about it. And if I should actually decide to go for writing a patch, I'll of course try to submit it -- though I'm not too optimistic about getting it accepted...

      I mean, realistically, why would any of you busy gurus want to spend time investigating a patch from a lil' girl without any credits in the perl community? In particular, as similar code has just recently been thrown out... and things overall don't exactly look like the whole world's been desperately waiting for this patch ;)

      Also, although I'm not entirely new to C coding, I've never seriously attempted hacking Perl's internals (closest I ever got was writing a few XS modules), and I have due respect for all the concepts and conventions that have evolved over time. At least it'll take me a while to catch up... Anyway, I'll still look into it.

      Well, I guess I should start lurking on p5p, to get a better feel for how things are being handled over there...

        I mean, realistically, why would any of you busy gurus want to spend time investigating a patch from a lil' girl without any credits in the perl community?

        Because the patch does things that are of value to the community, and because we don't judge the merit of a patch based on your reputation in the perl community. Now if you go on perl5porters and write a bunch of replies to bug reports you dont understand then you will quickly end up in the community killfile, but if you step up with a patch that is sufficient for the perl core developers to polish up and bed down thats a totally different story.

        Seriously, if the community sees you trying to do the right thing but with some rough edges they will step up to help you. We dont care who you are, we care what you contribute and whether your ideas are sound.

        And frankly we need people like you. Just based on your original post it sounds like you have some worthy contributions to make.

        Lastly: there arent that many Win32 devlopers active on p5p, since you apparently are one I reckon youd find yourself a lot more welcome than you realize.

        ---
        $world=~s/war/peace/g

Re: Japanese filenames and USING_WIDE in win32.h
by Anonymous Monk on Nov 15, 2006 at 00:39 UTC
    perl581delta
    (Win32) The -C Switch Has Been Repurposed
    The -C switch has changed in an incompatible way. The old semantics of this switch only made sense in Win32 and only in the ``use utf8'' universe in 5.6.x releases, and do not make sense for the Unicode implementation in 5.8.0. Since this switch could not have been used by anyone, it has been repurposed. The behavior that this switch enabled in 5.6.x releases may be supported in a transparent, data-dependent fashion in a future release.

    For the new life of this switch, see UTF-8 no longer default under UTF-8 locales, and -C in the perlrun manpage.

Re: Japanese filenames and USING_WIDE in win32.h
by mattr (Curate) on Nov 15, 2006 at 14:12 UTC
    Hello,

    I do not know the exact answer to your question since while having used Perl in Japanese environments it is not jperl. However I can point you in a couple of directions.

    I just got an email announcement (in Japanese, tell me if you want to see it) from Masayuki Moriyama (moriyama at miraclelinux.com) that he has published Encode::ISO2022JPMS, apparently part of a bounty to convert legacy encodings for perl and some other systems, which allows ISO-2022-JP-MS to be used on perl 5.8.x. See http://sourceforge.jp/projects/legacy-encoding (and click the Project Homepage link to see the Wiki (in Japanese). The email specifically mentions conversion of CP932 and some popular double byte Japanese symbols like the wide tilde mark. May be needed for your file names.

    Also I think you may not be alone in this and you should probably send an email (English is okay) to one of the best known encoding guys, Dan Kogai (maintainer of the Encode module. Incidentally I was reading the Encode::PerlIO pod and while it looks like you can do wild encoding on a filehandle I can't tell what happens to filenames. Encode claims to provide a layer that can alter encodings, which if true might be good for you.

    Anyway, I hope this is not worse than useless but as someone mentioned altering the course of perl itself, I think you ought to talk to the people like Dan who work hard on making the world safe for Japanese perl programmers, or vice versa. Anyway I presume you know all this stuff but before diving into the guts of it you might like to ask (if you haven't already) the Japanese guys what they think. They may also throw their hands up in the air though.. good luck. My guess is you should just buckle down and rewrite all those programs. It is easier I think than changing perl and maybe having to check if your deep mods continnue to work in future versions of perl.

    Matt

Re: Japanese filenames and USING_WIDE in win32.h
by Anonymous Monk on Nov 15, 2006 at 12:39 UTC
    Indirect syntax for system: Declare UNIVERSAL::system and use it for that syntax only. The other syntax is catched with a prototyped system. Yes there are two functions. So what?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://584078]
Approved by Arunbear
Front-paged by BrowserUk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (4)
As of 2024-03-28 13:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found