Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris

and (and friends) in sourcecode

by skazat (Chaplain)
on Dec 28, 2005 at 21:00 UTC ( #519633=perlquestion: print w/replies, xml ) Need Help??

skazat has asked for the wisdom of the Perl Monks concerning the following question:

Fellow monks,

What's a good practice in putting non ASCII characters in source code? Things like, "smart" quotes ( ) and friends? Is it safe just to simply type them into the code itself?

I'm basically using them in regex's to replace them with ASCII standins and HTML Entities.



-justin simoni
skazat me

Replies are listed 'Best First'.
Re: and (and friends) in sourcecode
by japhy (Canon) on Dec 28, 2005 at 21:41 UTC
    There's always
    use charnames ':full'; print "He said, \N{LEFT DOUBLE QUOTATION MARK}Never!\N{RIGHT DOUBLE QU +OTATION MARK}";

    Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
    How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart
Re: and (and friends) in sourcecode (Windows-1252)
by tye (Sage) on Dec 28, 2005 at 22:16 UTC

    Note that those characters aren't in Latin-1 but are in Windows-1252 so, when using an 8-bit-character encoding, you might see those characters correctly in any number of places (including many things not at all related to MicroSoft and things that you'd expect to be strongly compliant to standards and claiming to be using Latin-1) and yet still find plenty of things that don't agree that you have those characters correctly encoded.

    Since Perl source code is still most often in an 8-bit-character encoding (not UTF-8), putting such characters directly into your code certainly has a chance of not always working. You might be better off hard-coding both the Windows-1252 code points and the UTF-8 code points for such characters so that you'll catch them either way.

    - tye        

Re: and (and friends) in sourcecode
by TedPride (Priest) on Dec 28, 2005 at 21:09 UTC
    Works for me. I suppose you could always output all the characters from 0 to 255 (or is it 1 to 256?) with their hex codes and then use those instead, but why bother if the other method works? You just have to be careful of characters that mean something in regex.

    Given, when I did this, it was for personal use.

    EDIT: A module for this would be helpful, if anyone can suggest one. Something that translates useful but non-standard characters to entities, and removes a user-defined set of characters entirely.

Re: and (and friends) in sourcecode
by pboin (Deacon) on Dec 29, 2005 at 00:03 UTC

    "Non-ASCII" and "source code" sounds like you're just asking for trouble. Consider that not only does your code have to execute correctly, but it has to be documented, emailed, printed, and put into version control.

    Have you considered using different notation (like hex) to pick up those values? If you did, you'd still have a correct regex and at the same time have ASCII source.

    If it were me, I'd try *real* hard to make that happen.

Re: and (and friends) in sourcecode
by dragonchild (Archbishop) on Dec 29, 2005 at 01:21 UTC
    There's two scenarios you might have smart-quotes in your source code:
    1. Your editor put them there because its programmers thought they were smarter than you.
    2. You put them there within some string in your source code.

    In scenario 1, you need to find the settings that will affect that and fix them. The best way to do that is to change editors. Both emacs and vim are available for Windows, as well as many other excellent editors written for programmers. Pick one, learn it, and use it. I like vim, but that's just me. YMMV.

    In scenario 2, if you're having issues, just change your quoting operator. Instead of ', use q{}. Instead of ", use qq{}.

    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
Re: and (and friends) in sourcecode
by Errto (Vicar) on Dec 29, 2005 at 04:44 UTC
    As mentioned, non-ASCII characters are allowed in source files, but they will be interpreted according to your current locale settings (in Windows terms, the "ANSI codepage"). But copy that file to another machine and who knows what might happen. One safe way to deal with this, as others have mentioned, is to use Unicode escapes for non-ASCII characters. Another is to explicitly set the character encoding for your source file with the encoding pragma. But be aware that this pragma has some caveats so read its documentation carefully.
Re: and (and friends) in sourcecode
by planetscape (Chancellor) on Dec 29, 2005 at 09:21 UTC
Re: and (and friends) in sourcecode
by Celada (Monk) on Dec 29, 2005 at 20:58 UTC

    Don't be afraid to put non ASCII characters in your source code. Use UTF-8 (which you are probably already doing anyway since the curly quotes aren't available in too many other encodings) and put this pragma in your source code:

    use utf8;

    This is encouraged! From the utf8 manpage:

    The "use utf8" pragma tells the Perl parser to allow UTF-8 in the program text in the current lexical scope (allow UTF-EBCDIC on EBCDIC based platforms). The "no utf8" pragma tells Perl to switch back to treating the source text as literal bytes in the current lexical scope.

    This pragma is primarily a compatibility device. Perl versions earlier than 5.6 allowed arbitrary bytes in source code, whereas in future we would like to standardize on the UTF-8 encoding for source text.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://519633]
Approved by blue_cowdawg
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (7)
As of 2022-05-24 19:21 GMT
Find Nodes?
    Voting Booth?
    Do you prefer to work remotely?

    Results (84 votes). Check out past polls.