Positions of certain tokens in syntax tree

rubystallion has asked for the wisdom of the Perl Monks concerning the following question:

I know how to do basic regexp-based parsing with (?(DEFINE)...) blocks and named groups that recurse into those definitions, like in the example in the perlre man page. I think after such a parsing match only information about the top-level tokens is available, but I would like to be able to modify the lower-level tokens in the text I'm matching against. So let's say I have a parsing regexp that at the top level can contain statements or expressions, and these can contain identifiers, and I want to change all identifiers to upper case in the text (not only in the parse tree), how would I do that? Would it be better to use a module for that?

Edit as suggested by Rolf:
I got this idea when my colleague complained that Golang forces casing conventions on users and I noticed that I don't know enough about regexes to, say, change just the case of identifiers. And I thought, if I can get the Go grammar into Perl, then I might sometimes have some scripting fun at my day job. I've seen the regexp JSON parser and I like how close this looks to an ANTLR grammar and I guess if using a module for tree data structures, then even the $^R manipulations might become readable, though I've never seen an example of that? If I went down that route, then I could probably use some trickery with using \substr instead of $^N to save substring references to the original string into the $^R syntax tree, though I can see that becoming an unmaintainable mess. Using m/\G.../gc to do the parsing seems like a lot of typing for a grammar that is already several hundred lines in ANTLR form. Here's a simple code example of a parsing regexp. And here is the Go ANTLR grammar that I would like to translate.

#!/usr/bin/perl -w
use v5.20;
use re 'strict';

$_ = 'my$a=5;';
my $re = qr/
    (?<statements>(?&statement)*)
    (?(DEFINE)
      (?<statement> my \$ (?&identifier) = \d ; )
      (?<identifier> [a-z] )
    )/x;
$_ =~ $re;
say $+{statements};
# Assuming I do the $^R and $^N stuff to create a parse tree, then it 
+would be nice, if I could write something like:
# ${$r->{statements}[0]->{identifier}} = 'b';
# in order to change $_ to "my$b=5;"
# similar in spirit to substring references:
# my $s = \substr $_, 3, 1;
# $$s = 'b';
[download]

Comment on Positions of certain tokens in syntax tree Download Code

Replies are listed 'Best First'.
Re: Positions of certain tokens in syntax tree by LanX (Saint) on Dec 14, 2019 at 00:50 UTC
I'm guessing now ... Here a "naive" approach by embedding Perl code. When parsing a recursive syntax you are able to put a Perl code after each match pattern using `(?{...})` and `$^N` will give you the content of this last match. Compare this example from `perlre` `$_ = "The brown fox jumps over the lazy dog"; /the (\S+)(?{ $color = $^N }) (\S+)(?{ $animal = $^N })/i; print "color = $color, animal = $animal\n";` [download] instead of storing the match you might concatenate it to a "result" string or print it to a channel > and I want to change all identifiers to upper case in the text depending on match pattern execute a `uc` before returning the result. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery FootballPerl is like chess, only without the dice}	[reply] [d/l] [select]
Re^2: Positions of certain tokens in syntax tree by rubystallion (Novice) on Dec 14, 2019 at 08:43 UTC
Hi Rolf, thanks for your reply. I've hopefully made the question clear now, let me know if you have any questions. I think your solution might work. I'm thinking how best to write the parsing regexp in a generally useful way, so that I can use it in the future for other tasks I might think about. Maybe I should both build a result string and a parse tree within the regexp's embedded code then.	[reply]
Re: Positions of certain tokens in syntax tree by LanX (Saint) on Dec 14, 2019 at 00:02 UTC
I have a faint idea what you mean, but it would be much easier if you provided some examples and links. Such illustrations help to avoid misunderstandings and speculative answers. Please see also Short, Self-Contained, Correct Example Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery FootballPerl is like chess, only without the dice}	[reply]
Re: Positions of certain tokens in syntax tree (updated) by LanX (Saint) on Dec 14, 2019 at 13:10 UTC
Some thoughts on the update: the example doesn't require recursive parsing it's not clear why (DEFINE) should be preferred offer interpolating $vars holding regex objects generated with `qr` it's not clear why using one big regex should be preferred over using splitted regex with /gc modifiers inside a Perl control flow This board is not only about answering questions but also about creating good educating reads for the archive. :) A good example highlighting the desired outcome including tests and with links to the implied technologies (Perl docs, WP pages, Cpan...) could spawn an interesting thread which could lead to a semi tutorial. Sorry if this sounds fuzzy, but TIMTOWTDI and too many parser generator modules addressing this area°. If you want a "generic" solution you need to explain more. :) Update °) for instance like listed here: https://perl-begin.org/uses/text-parsing/ Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery FootballPerl is like chess, only without the dice}	[reply]
Re^2: Positions of certain tokens in syntax tree (updated) by rubystallion (Novice) on Dec 15, 2019 at 13:39 UTC
That's a good suggestion, I'll experiment with the alternatives a bit and see if I can come up with a good example including tests and links. I'll still have to flesh out the desired outcome. I assume that many refactorings could be implemented by just knowing the AST annotated with source positions and I would like to try to write something like that in Perl. Now I notice that using substring references actually wouldn't be a good idea, because assigning a different-length string to one of the references would invalidate the other references in that they wouldn't point to the token any more. I'll see if I can find out how refactoring tools in major IDEs do this, I'm not sure if they keep code formatting intact or if they just pretty-print the new AST in a canonical way. About your thoughts: While the example doesn't require recursive parsing, the Go grammar does require some recursive parsing, for example a statement parser can invoke a block parser, which then again invokes the statement parser. When I update the post with a clear desired outcomes and tests, I will update the example to better match the complexity of the task, while still keeping it small. Because of the recursive calls it's necessary to use (DEFINE) instead of interpolating $vars The trade-offs I see is that with one big regex it's easier to see at a glance how the grammar rules are connected, especially with syntax highlighting, whereas with splitted regexes it's possible to set breakpoints in specific grammar rules for debugging and it's not such an esoteric solution, so it's better suited for writing widely readable code. For reference, I will explain my understanding of the embedded code in the big regex for parsing JSON. I will describe the data structure returned by the embedded code as a tree to make the explanation clearer: After returning from each rule, the payload (result of the last embedded code expression) is set to the previous payload in the left subtree ($^R->[0]) and the rule's payload in the right subtree ($^R->1) In order to make this invariant hold: the payload for a rule with 0 or more matches of a subrule must be set at the rule's beginning for the 0-matches case when combining payloads from multiple subrule matches (from a Kleene star or from a rule having multiple subrules), the right branches of the tree must be combined in the right subtree and the left subtree is set to the only left branch in the tree	[reply]
Re^3: Positions of certain tokens in syntax tree (updated) by LanX (Saint) on Dec 16, 2019 at 23:57 UTC
Hi sorry I'm too busy right now to dive deeper into your thoughts, but > When I update the post with a clear desired outcomes and tests, I will update the example to better match the complexity of the task, while still keeping it small. Please don't. If you want to be seen, either reply to the OP or start a new thread while referencing this. BTW: Merlyn's JSON parser has embedded Perl code for debugging, it was just out-commented. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery FootballPerl is like chess, only without the dice}	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.


Do you know where your variables are?
	PerlMonks

Positions of certain tokens in syntax tree

Update