Object Query Languages and Parsers

TedYoung has asked for the wisdom of the Perl Monks concerning the following question:

Good Day Fellow Monks,

I am considering writing a parser and wanted some advice. Before I propose my main questions, I want to avoid a potential XY Problem and tell you what I need to accomplish.

For the last five years, I have developed and maintained a Perl object persistence layer. It is one of those APIs that let you define fields in a Perl object and it takes care of all the database stuff. I also developed a query language for it. The language is quite powerful with a minimum amount of syntax.

So, all this is working great for us. However, I want to extend the query language (QL) and it current implementation doesn't lend itself well to that. :-)

First, I started looking for existing Object Query Language standards and/or implementations that, perhaps, I could use instead of remaking my own. Well, this area of the industry is still pretty underdeveloped. I only found a few standards and they had rather limited functionality in contrast to our current implementation (i.e. JDO and J2EE's CMP]). So, my first question, does anyone know of any good language standards or (better) actual implementations that I should look at?

Back to enhancing our current QL: the current QL compiler basically runs regexes against a QL string, looking for field navigation instances (e.x. "department.managers.firstName) and replacing them with corresponding database fields and joining in related tables where necessary.

The regexes work fine, but the current code is not very maintainable. Either way, I am interested in doing a re-write of the compiler before I start adding the additional functionality.

So, I am looking hard at this and thinking; "Do I want to continue to use plain regexes, or completely parse the QL?" If I am going to parse, what API/technique should I use? My prioritized requirements are:

Light weight: For ease of installation and portability, I want to avoid native modules, at least at the moment.
Speed: The persistence API is used mostly in a CGI environment. During a request, we use 1 to 5 queries on average. The QL statements are generally very short, very rarely more than 300 characters long.
Ease of Development: Well, I am lazy. I don't want to work hard. If I have to build a compiler, I would like to have fun doing it. At lest this is my lowest priority of the three. :-)

This is when I bought Higher Order Perl for some inspiration. I like the lexing techniques proposed in the book. However, a lot of the stream oriented stuff is overkill here since I am dealing with small strings.

When it comes to parsing, the book implements a Left Hand Recursive Descent Parser. This book is a lot of fun BTW! When I saw that I thought of Parse::RecDescent, which would work wonders. But, I have heard many times that Parse::RecDescent is slow.

Has anyone used Parse::RecDescent in a CGI context to compile an average of 3 short strings per request? If so, how was performance?

I have started on an initial re-implantation. I lex up the strings nicely; there is no problem here. As I am parsing the tokens and converting them to SQL, I am finding myself writing a lot of recursion. So, this is why I ask if Parse::RecDescent is slow because of features, or because of the recursive implementation.

I also checked out Parse::Yapp. This is basically a Perl version of yacc. I thought that there may be a performance benefit by generating a parser as opposed to having the API re-interpret a grammar definition each time. However, the docs suggest that the closest Parse::Yapp gets to making a stand-alone parser is it copies the interpreter code into the generated module along with the grammar definition. So, on that ground it is equal to Parse::RecDescent. Anyone have any experience with the performance of either of these?

Normally benchmarking would be a simple answer to all of this. But, because writing a parser it a non-trivial task, it would be very difficult to implement different parsers in Parse::RecDescent, Parse::Yapp and RegEx to test them. I am hoping that our community can share with me their ideas so that I can make a more informed descision.

Thanks for reading this far!

update:dragonchild asked to see an example of the QL. Here are a few examples of some of the supported syntax. They should be pretty self-explanitory.

# All invoices above a MinValue in a City and State
select object()
from Project::Invoice
where lineitems.total.sum() > ?MinValue
and billto.address.where(city = ?City and state = ?State)

# All employees who were paid less than the value of the 
# projects for which they worked.
select object()
from Foo::Employee
where salary > parents(Foo::Project, employees).billings.lineitems.sum
+()

# All contacts whose first and second alternates are in
# State and whose fax starts with AreaCode
# or have referred N jobs about Value in dollars
select object()
from Bar::Contact
where alternates[0..1].address.state = ?State
and numbers{fax} like concat(?AreaCode, '%')
or refferals.jobs.where(count() >= ?N and billings.lineitems.total.sum
+() >= ?Value) is not empty
[download]

Ted Young

($$<<$$=>$$<=>$$<=$$>>$$) always returns 1. :-)

Comment on Object Query Languages and Parsers Select or Download Code

Replies are listed 'Best First'.
Re: Object Query Languages and Parsers by kvale (Monsignor) on Apr 21, 2006 at 21:54 UTC
Regarding object query langugages, you might look at the CORBA system. As an object request broker, it needs to be fast, so the syntax is pretty simple. Of course, it may be too low level, but will give you ideas. With regard to parsing, I would first create a grammar. I find Parse::RecDescent too slow when speed is of the essence. Writing your own top-down parser should speed things up and for a small grammar, it should be be easy to maintain as well. Top-down parsers are constructed from a grammar as follows: Each nonterminal in your grammar becomes a subroutine that you call Alternation '\|' is handled with if-then statments Qunatifiers like '*' and '+' are handled with loops Nonterminal strings are matched using regexes. For learning about parsing and translation, I recommend the dragon book: Compilers, by Aho, Sethi, and Ullman. -Mark	[reply]
Re: Object Query Languages and Parsers by rhesa (Vicar) on Apr 21, 2006 at 23:04 UTC
According to the Parse::RecDescent::FAQ, it's possible to precompile parsers. That might well take care of performance problems, and sounds like a very good idea for a CGI script. That FAQ has more info on optimizations. Apparently most of the complaints relate to big files (both grammar and input). The latter doesn't seem to be a problem in your case.	[reply]
Re: Object Query Languages and Parsers by TedYoung (Deacon) on Apr 22, 2006 at 00:11 UTC
Another idea I have been toying with is the idea of lexing the source into an array of tokens. Then, compiling the token list into a string using a single character to represent each token type. I then could regex against the string (effectivly regexing for token patterns). Then use the indexes found in @- and @+ to return the tokens represented by the captured groups. I wrote a nice object-oriented version, complete with operator overloading, but here is the main idea of the code: `@tokens = ...; $tokens = join '', map { $tokenToChar{ $_->type } }, @tokens; if ($tokens =~ / ($IDENT)($DOT $IDENT)* /x) { my @captured = map { [ @tokens[ $-[$_], $+[$_] ] } 1..$#-; my ($head, $tail) = @catpured; }` [download] This worked beautifully. Since a three to four hundered character QL string could be represented by, maybe, 80 tokens, the token string is very small. All the regexes are on single chars so backtracking is minimized. And I could use character classes `[$TOK1 $TOK2]` to simulate token classes. All in all, it seems like a very efficient approach. I may even be able to apply this technique to P::RD. Thank you rhesa. I will read through the P::RD FAQ on optimization! And thanks for all of the feedback so far. Ted Young `($$<<$$=>$$<=>$$<=$$>>$$) always returns 1. :-)`	[reply] [d/l] [select]
Re^2: Object Query Languages and Parsers by Gilimanjaro (Hermit) on Apr 23, 2006 at 23:13 UTC
Before optimizing, caching and tokenizing, I'd advise on benchmarking to see if it's worth the trouble (and obfuscation)... You mentioned you're running your request through CGI, which would imply there's a whole bunch of overhead going on elsewhere, which may well slow you down more then P::RD itself... I'm not saying you shouldn't optimize, cache or tokenize, but I'm saying you may want to make sure it would make a notable difference in performance in your production enviroment. As for how to go about optimizing, caching or tokenizing if you do choose to do it: the other monks seem to have covered all that quite sufficiently... ;)	[reply]
Re: Object Query Languages and Parsers by roboticus (Chancellor) on Apr 21, 2006 at 23:59 UTC
TedYoung: Since you have control over the language and the parser, you may want to think about designing your language extensions to simplify and speed parsing. Some parsers execute very quickly (such as a shift & reduce type parser), so you might want to think about simplifications that would let you use a simple / efficient parsing technique. --roboticus	[reply]
Re: Object Query Languages and Parsers by stvn (Monsignor) on Apr 22, 2006 at 14:21 UTC
I cannot comment on parser choices, however it is possible to approach this in such a way that this would become irrelevant. If execution speed is a requirement, I would suggest not recompiling the QL strings every time. Instead I would suggest pre-compilation of the QL into some kind of AST (abstract syntax tree), and then stashing the tree (keyed by the original QL text) in a DBM::Deep file. A simple example might be: `# original QL code: my $ql_code = 'select object() from Foo::Bar'; # parsed into something like this my $ast = [ select => [ 'object' ], [ from => 'Foo::Bar' ]] # create the query db my $query_db = DBM::Deep->new; $query->{$ql_code} = $ast;` [download] Then in your CGI scripts, they would just need to load the query database (DBM::Deep is very memory efficient so works quite well under CGI), and access the ASTs when it needs to run the QL. This basically elimnates the parse time from the equation entirely. If you didn't want to have a seperate QL compilation step, you could lazily compile the QL. The first time you encounter a new QL statement, you could parse it and store the AST and then next time it is called it won't need to reparse. This approach is similar to Python and the py/pyc files it makes. Of course you would still have the overhead of interpreting the AST, but I would guess that is not that bad compared to the parsing overhead. If you wanted to take it one step further, you could compile the AST into a function which is tailored to the specific needs of the AST (probably using some heuristics here). The pre-compiled function would be generated as a string, so is easily persisted, and you just need to eval it each time you need to load it (which might eat up all the possible performance advantages). It might look something like this: `# assuming the query above # the compiled sub might look # something like this sub { my $query_engine = shift; return $query_engine->select( 'object', $query_engine->from('Foo::Bar') ); }` [download] This idea might be a little on the excessive side, so take it with a grain of salt. -stvn	[reply] [d/l] [select]
Re^2: Object Query Languages and Parsers by dragonchild (Archbishop) on Apr 22, 2006 at 14:39 UTC
This idea might be a little on the excessive side, so take it with a grain of salt. Not really - this is what Template Toolkit does. Also, note that there are no serious plans for DBM::Deep to store native functions. You can use Data::Dump::Streamer to serialize the functions and eval them later. (Look at Sub::Compose for some example code.) My criteria for good software: Does it work? Can someone else come in, make a change, and be reasonably certain no bugs were introduced?	[reply]
Re: Object Query Languages and Parsers by dragonchild (Archbishop) on Apr 21, 2006 at 20:25 UTC
While I can't help your question, I would like to see your QL, if possible. My criteria for good software: Does it work? Can someone else come in, make a change, and be reasonably certain no bugs were introduced?	[reply]


Clear questions and runnable code get the best and fastest answer
	PerlMonks