Perl is extremely difficult to parse. In fact, some would say impossible.>>>>> "Makhno" == Makhno <mak@imakhno.freeserve.co.uk> writes: Makhno> I'm thinking of writing a GUI Perl-syntax-aware editor, and Makhno> wondering what's the best way to parse perl? Highlighting Makhno> reserved words is easy (using, eg, index()) but indentifying Makhno> things like comments is a bit more difficult. Makhno> A regex like /#.*\n/ will catch comments when they are used Makhno> simply, ie: Makhno> print "hello\n"; #print hello Makhno> but will get it wrong when the '#' is used as part of a regex Makhno> (or in a string) Makhno> s#hello#goodbye#; Makhno> print "will behave like a #comment"; Makhno> Does anybody have any ideas on how I go about parsing perl Makhno> syntax in such a way, before I go to a lot of potentially Makhno> unnecessary work?
One thing that makes it difficult is the dual nature of a half dozen characters like "/". If that / is being used in a place that's expecting an operator, it's divide. If it's being used in a place that's expecting an operand, it's the beginning of a regular expression. So you have to keep track at all times of whether you're looking for an operator or an operand.
"No problem", you say? Quick... for the following, play the game of "regex or divide?"
Got those right? How about these?sin / ... time / ... localtime / ... caller / ... eof / ...
OK, and now some of your own:use constant FOO => 35; FOO / ... use Fcntl qw(LOCK_SH); LOCK_SH / ...
Got those too? How about these (same problem, different file):sub no_args (); sub one_arg ($); sub normal (@); no_args / ... one_arg / ... normal / ...
A little harder, eh? So now you have to parse OUTSIDE the file to get your answer. And as if that wasn't enough, let's get weird:use Random::Module qw(aaa bbb ccc); aaa / ... bbb / ... ccc / ...
Quick, was that last one a divide or a regex start?BEGIN { eval (time % 2 ? 'sub zany ();' : 'sub zany (@);'); } zany / ...
Why does it matter? Look at this:
The first one is computing the sin of the true/false value gotten by matching " 25 ; # " against $_. Then it dies. The second one is computing the time of day divided by 25, then ignoring the comment.sin / 25 ; # / ; die "this dies!"; time / 25 ; # / ; die "this doesn't die";
Starting to see the trouble?
This leads people to say "the only thing which can parse Perl (the language) is perl (the binary)". Maybe not for Perl6. But for the Perl we know and can use today, certainly so.
-- Randal L. Schwartz, Perl hacker
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: On Parsing Perl
by quidity (Pilgrim) on Dec 04, 2000 at 04:40 UTC | |
by Anonymous Monk on Aug 11, 2005 at 03:56 UTC | |
Re: On Parsing Perl
by repson (Chaplain) on Dec 04, 2000 at 07:28 UTC | |
Re: On Parsing Perl
by toadi (Chaplain) on Dec 04, 2000 at 13:49 UTC | |
Re: On Parsing Perl
by nop (Hermit) on Dec 04, 2000 at 21:21 UTC | |
by merlyn (Sage) on Dec 04, 2000 at 21:24 UTC | |
Eight years later...
by samwyse (Scribe) on Jan 13, 2009 at 19:08 UTC | |
Re: On Parsing Perl (Once upon a time)
by Anonymous Monk on Jul 11, 2022 at 11:27 UTC | |
by haukex (Archbishop) on Jul 11, 2022 at 11:42 UTC | |
by LanX (Saint) on Jul 11, 2022 at 11:58 UTC | |
by haukex (Archbishop) on Jul 11, 2022 at 14:06 UTC | |
by LanX (Saint) on Jul 11, 2022 at 15:05 UTC | |
Re: On Parsing Perl
by gaggio (Friar) on Dec 04, 2000 at 04:35 UTC | |
by merlyn (Sage) on Dec 04, 2000 at 04:50 UTC | |
by Anonymous Monk on May 10, 2008 at 13:03 UTC |