http://qs321.pair.com?node_id=44722

{from an alt.perl post I just made, reposted here to solicit feedback from fellow monks...}

>>>>> "Makhno" == Makhno <mak@imakhno.freeserve.co.uk> writes: Makhno> I'm thinking of writing a GUI Perl-syntax-aware editor, and Makhno> wondering what's the best way to parse perl? Highlighting Makhno> reserved words is easy (using, eg, index()) but indentifying Makhno> things like comments is a bit more difficult. Makhno> A regex like /#.*\n/ will catch comments when they are used Makhno> simply, ie: Makhno> print "hello\n"; #print hello Makhno> but will get it wrong when the '#' is used as part of a regex Makhno> (or in a string) Makhno> s#hello#goodbye#; Makhno> print "will behave like a #comment"; Makhno> Does anybody have any ideas on how I go about parsing perl Makhno> syntax in such a way, before I go to a lot of potentially Makhno> unnecessary work?
Perl is extremely difficult to parse. In fact, some would say impossible.

One thing that makes it difficult is the dual nature of a half dozen characters like "/". If that / is being used in a place that's expecting an operator, it's divide. If it's being used in a place that's expecting an operand, it's the beginning of a regular expression. So you have to keep track at all times of whether you're looking for an operator or an operand.

"No problem", you say? Quick... for the following, play the game of "regex or divide?"

sin / ... time / ... localtime / ... caller / ... eof / ...
Got those right? How about these?
use constant FOO => 35; FOO / ... use Fcntl qw(LOCK_SH); LOCK_SH / ...
OK, and now some of your own:
sub no_args (); sub one_arg ($); sub normal (@); no_args / ... one_arg / ... normal / ...
Got those too? How about these (same problem, different file):
use Random::Module qw(aaa bbb ccc); aaa / ... bbb / ... ccc / ...
A little harder, eh? So now you have to parse OUTSIDE the file to get your answer. And as if that wasn't enough, let's get weird:
BEGIN { eval (time % 2 ? 'sub zany ();' : 'sub zany (@);'); } zany / ...
Quick, was that last one a divide or a regex start?

Why does it matter? Look at this:

sin / 25 ; # / ; die "this dies!"; time / 25 ; # / ; die "this doesn't die";
The first one is computing the sin of the true/false value gotten by matching " 25 ; # " against $_. Then it dies. The second one is computing the time of day divided by 25, then ignoring the comment.

Starting to see the trouble?

This leads people to say "the only thing which can parse Perl (the language) is perl (the binary)". Maybe not for Perl6. But for the Perl we know and can use today, certainly so.

-- Randal L. Schwartz, Perl hacker

Replies are listed 'Best First'.
Re: On Parsing Perl
by quidity (Pilgrim) on Dec 04, 2000 at 04:40 UTC

    I do most of my perl coding using CPerl mode for Xemacs, and although it is very good at spotting syntax it is often horribly wrong, especially when odd quoting characters or pod is brought into the equation. eval is even worse. I'd advise anyone even thinking of trying to parse perl to look at what can be achieved, and then either improving that (to the benefit of everyone) or just to give up.

    I do sometimes find myself chosing a particular way of coding over another (possibly better) way because the second breaks the pretty printing, and I want others using the same editor to be able to read the code I write.

      I hate to say this, but even as an XEmacs fan, cperl+gemacs is far superior. Recent GNU Emacs has some extra stuff that XEmacs doesn't have, that allows cperl to do some really amazing things. (I have vague memories of the "extra stuff" being multiple syntax transition tables for each character, so you can gracefully handle things like m!!x and other non-standard delimiters. But I could be totally wrong.)
Re: On Parsing Perl
by repson (Chaplain) on Dec 04, 2000 at 07:28 UTC
    You could use B::Deparse for some of it, which eliminates some of what you don't want, but even that fails on many other things. The best bet is to code for the majority of perl and leave programmers to use their heads for the rest. This is what I do with syntax highlighting in vim, I use it generally but don't belive it for a moment. It is still sometimes helpful anyway. This is the way it will have to stay for now, at least until Perl6...
Re: On Parsing Perl
by toadi (Chaplain) on Dec 04, 2000 at 13:49 UTC
    I'm with you merlyn. I use vim(*nix) and textpad(windows), both make some mistakes. Like in some regex syntax it does some weird things...


    --
    My opinions may have changed,
    but not the fact that I am right

Re: On Parsing Perl
by nop (Hermit) on Dec 04, 2000 at 21:21 UTC
    I use the perl mode on emacs, and resort to small tricks to keep everything ok. For example,
    s/'"/;
    upsets the syntax colorization badly (as emacs thinks following code is in the string), so I use idioms like
    s/'"/; #"'
    to "close" my "open" strings....
Eight years later...
by samwyse (Scribe) on Jan 13, 2009 at 19:08 UTC
    I decided to run this test script under various versions of Perl.
    @examples = split /\n/, <<'EXAMPLES'; sin / ... time / ... localtime / ... caller / ... eof / ... use constant FOO => 35; FOO / ... use Fcntl qw(LOCK_SH); LOCK_SH / ... sub no_args (); sub no_args{1}; no_args / ... sub one_arg ($); sub one_arg{1}; one_arg / ... sub normal (@); sub normal{1}; normal / ... EXAMPLES for (@examples) { s=\.\.\.=25 ; # / ; die "this dies!";=; local($a) = eval; $a = $@ if $@; print "$_\n\t$a\n"; }
    I don't know what the results would be for earlier versions, but from Perl 5.6 onwards it's pretty consistent.
    Example5.0065.0085.010
    sindiesdiesdies
    time49274891.7249274891.7449274891.76
    localtimediesdiesdies
    callerdiesdiesdies
    eofdiesdiesdies
    FOO1.41.41.4
    LOCK_SHdiesdies0.04
    no_argsdiesdiesdies
    one_argdiesdiesdies
    normaldiesdiesdies
    Most of the "dies" instances also produced this message: Warning: Use of "XXX" without parentheses is ambiguous at (eval N) line 1. However, the LOCK_SH example never generated errors, while the last three generated "Prototype mismatch" messages. I must also note that almost all of the examples generated warnings, despite using neither the '-w' option or 'use strict;'
Re: On Parsing Perl (Once upon a time)
by Anonymous Monk on Jul 11, 2022 at 11:27 UTC
    I'm currently working on something, (basically perl parser), and apart from the BEGIN block, everything seems parseable (is that the right word?) using some simple LL grammars. Or am I just too uneducated?
      everything seems parseable (is that the right word?) using some simple LL grammars.

      No, only perl (the interpreter) can parse all of Perl (the language). See my node here for details.

      Edit: added emphasis.

      Static parsing is only reliable, if you rule out or control all imported subs, because prototypes change the way Perl is parsed. See HaukeX's other reply.

      Basically changes at compile time ( see BEGIN blocks ) can change the parser.

      Dynamic parsing is possible though, if you inspect the op-tree after compilation, that's the basic idea of some newer tools, like the perlnavigator.

      See also perl -c in perlrun or B::Xref

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

        Static parsing is only reliable, if you rule out or control all imported subs, because prototypes change the way Perl is parsed.

        I think, though, that prototypes aren't the only reason Perl isn't statically parseable. There are quite a few heuristics that the parser uses that aren't all too well documented, and I'm not sure if a static parser would be able to reimplement all of them. And then there is no strict code, which I think gets even trickier. At some point I was considering researching and making a list of all of the reasons, but I unfortunately never got around to it.

Re: On Parsing Perl
by gaggio (Friar) on Dec 04, 2000 at 04:35 UTC
    I don't know if I am with you there. What is the Perl executable doing when it executes a script? Isn't that called parsing also?
      Stunning reading comprehension there. I wonder if such clueless folk ever come back and read their comments years later and feel a twinge of embarrassment.