evaling unicode perl source

gildir has asked for the wisdom of the Perl Monks concerning the following question:

Fellow monks,

I want to write a simple emebeded perl processor. No problem here. I'll write a pattern, separate perl code from the rest and use an eval() call on it. And now the problem: the file I evaluate is encoded in unicode. Eighter utf8 or utf16.

How do I evaluate UTF16 perl source? In 'normal' case "print 'foo'" will be encoded as "\0p\0r\0i\0n\0t\0 \0'\0f\0o\0o\0'" and that wont eval because every "\0" character will effectively end a string. Another problem is how to run a pattern on a utf8/16 string.

Recoding the source to any 8-bit charset prior to evaling will not work. Some national characters could be lost during this conversion.

Comment on evaling unicode perl source

Replies are listed 'Best First'.
Re: evaling unicode perl source by John M. Dlugosz (Monsignor) on Oct 08, 2001 at 20:20 UTC
If you are running NT/2000, there is a Win32 API that will do that. It's not present in Win9x, though, and it has problems with its handling of illegal codes, so I have my own C++ function UCS2_to_UTF8 written in assembly language. UTF-8 is Perl's native mode. Use "use utf8" before the RE is parsed, and it will work just fine. —John The Win32 Saint	[reply]
Re: Re: evaling unicode perl source by mamut (Sexton) on Oct 09, 2001 at 18:03 UTC
-=- MamuT -=- Is it same on Unix like solaris, Linux ???	[reply]
Re: Re: Re: evaling unicode perl source by John M. Dlugosz (Monsignor) on Oct 09, 2001 at 19:44 UTC
Is "it" the same? If you mean will Perl swollow UTF-8 and handle UTF-8 sequences as single characters in RE's, then yes. Is there a function in the OS to convert USC-2 or UTF-16 into UTF-8? I don't know. Will my function work? Only on x86 machines. However, the reference implementation in the Unicode book is written in portable C and runs on anything. —John	[reply]


Think about Loose Coupling
	PerlMonks