Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Idea for XPath implementation

by Jaap (Curate)
on Apr 01, 2003 at 09:48 UTC ( [id://247194]=perlquestion: print w/replies, xml ) Need Help??

Jaap has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

Perl needs a fast and memory efficient XPath module.

The key to making a fast & memory efficient XPath processor is to find a smart data structure to derive from the XML document and keep in memory. The XPath query should then be done largely on that data structure.

Idea:
Suppose we build an array with values like this for each node: 'elementName,parentId,byteIndex' Where
elementName is the name of the element,
parentId is the arrayindex of the parent and
byteIndex is the byteposition of the element in the xml document.

If we have this xml document:
<?xml version="1.0"> <colors> <paprika> <good>red</good> <bad>green</bad> </paprika> <banana>yellow</banana> </colors>
The above xml doc would generate a data structure like this:
my @dataStructure = ( 'colors,-1,22', 'paprika,0,33', 'good,1,47', 'bad,1,68', 'banana,0,100', );
Now if the XPath query looks like this: /colors/paprika/bad then we'd start with the last part: bad, find it in the array and see if his parent is paprika (this can be done VERY fast since we have the arrayindex of the parent). If it is paprike, look for his parent to see if it's colors. If all that's true, we have the byteIndex to look in the XML document for the content of the element.

To keep it short now, are there any obvious shortcomings to this approach? Is it worth implementing? I have some optimisations in mind for this, but that would only make the story longer.

Replies are listed 'Best First'.
Re: Idea for XPath implementation
by mirod (Canon) on Apr 01, 2003 at 11:45 UTC

    XML::XPath is not the only XPath implementation you can look at: XML::Filter::Dispatcher which has a complete XPath parser (at least you can reuse the YAPP grammar). You could also look at the way libxml2 (the base for XML::LibXML) does it. Even though it is written in C you might be able to re-use the data structure and the algorithms.

    Finally, XPath is quite a complex standard, that goes way beyond the simple directory-like syntax that you are showing, it includes predicates, function calls... so it might not be easy to get a fast implementation of all of this (i.e a structure that speeds-up some queries might not be quite appropriate for others). DB-XML actually maintains user-defined XPath queries as indexes, that might be a good strategy.

Re: Idea for XPath implementation
by BrowserUk (Patriarch) on Apr 01, 2003 at 10:24 UTC

    What happens if <banana> also had children of <good> and <bad>? Now you would have to search for and track back two chains... and when you add apple/orange etc., it gets slower and slower.

    Also, a HoHoH... will always be considerably faster than a linear search of an array.


    Examine what is said, not who speaks.
    1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
    2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
    3) Any sufficiently advanced technology is indistinguishable from magic.
    Arthur C. Clarke.
      First all <bad>'s are searched for and traced back. In this way, you get all the query results.

      You make a good point about the slow linear search but as i mentioned in my post, i want to keep the idea clear so i didn't talk about optimisations. However, the order of the array is important for some XPath queries so we'd make a second - binary sorted - array for speed.
Re: Idea for XPath implementation
by zby (Vicar) on Apr 01, 2003 at 10:30 UTC
    The name of the node is not unique - you can have many <bad> tags and you would have to search them all. This would be inefficient if there is just one <bad> node under the /colors/paprika/ subtree.

    To state the obvious XPath is a bit complex - you did look only on one kind of query for one kind of XML structure. Perhaps it might be usefull for such a restricted case. It might be even a good start for something more general, but you would have to add much to it.

    But to start you should do a research on existing solutions. Do you know what data structures uses XML::XPath? Have you looked at XML DB (this is a library I've just recently learned about on PM)?

      I did do some research on other XPath implementations (XML::XPath) but i wanted to keep my post clear & focussed.
      XML::Xpath uses XML::Parser to build a nested hash tree of the entire XML document.
        Ok - so sorry for that question. I hope I did provide the focussed answer in the beginning of my post.
Re: Idea for XPath implementation
by dakkar (Hermit) on Apr 01, 2003 at 11:36 UTC
    /descendant-or-self::*[@id='this']/preceding-sibling::*[starts-with(local-name(),'a')][contains(namespace-uri(),'perlmonks')]//some[3]/text()

    I know, it's quite an unlikely expression. But XPath is not only about child node names...

    -- 
            dakkar - Mobilis in mobile
    
      You make a good point about the complexity of XPath. I am aware of it. The reason why i made the simple example was to err... make a simple example ;-)
Re: Idea for XPath implementation
by thor (Priest) on Apr 01, 2003 at 13:40 UTC
    I know nothing about XPath, so take it for what it's worth, but this method of searching seems a bit backwards. That'd be like giving your OS of choice a file called /path/to/file, having the OS look for all files called 'file', anf for each one, seeing if it's in a directory called 'to', etc. If you have a heirarchical structure, best to make use of that fact and narrow your search quickly. A nested hash structure comes to mind for implementation...this way, you can see if $struct{colors}{paprika}{bad} exists, and have stored in that the value that you have listed above.

    thor

      The disadvantage of building a hashtree is that it takes a lot of memory.
      Also, on an XPath query like //banana, you have to search every leaf to see if it is called banana.
        You're right, it can take a lot of memory. If you're concerned about it, use a tied DBM file. My experience with them is light, but from what I remember, at least one of them handles nested hashes. Also, I don't see why looking for //bananna would take long at all. If you've got all of your stuff in a hash, a simple print "found banana" if $hash{banana} would suffice, I think. Or, if you want it to be even cleaner, you could do something like this:
        $hash{/banana} = 1.35; $hash{/banana/shake} = 1.48; $value = value("banana", "shake"); sub value { my $key = join "/", @_; return $hash{/$key}; }
        Like I said, though, I don't know anything about XPath, so I might be talking out of my hind quarter...:)

        thor

Re: Idea for XPath implementation
by BrowserUk (Patriarch) on Apr 02, 2003 at 10:30 UTC

    From your later replies I can now see the problem that you are trying to solve, which I admit I didn't really appreciate at first.

    I tried to see how to apply your structure to an XML file that contained repetitous elements

    Eg.

    <users> <user> <id>fred</id> <pass>crude</pass> </user> <user> <id>barney</id> <pass>polite</pass> </user> </users>

    And I think I could see it working for a simple repetitious grouping as shown here, once you start getting groups within groups, you end up with a huge, tautological mess that isn't even easy to see how to construct or search using nested recursion. It certainly won't be quick or easy.

    The only reasonably simply solution that I can think of is to build an single-level array, with each unique path in the data represented using a string representation of its path, and a second array of arrays that is a fully inverted index of the first.

    my @paths = ( "/colours/paprika/1/good/red", "/colours/paprika/1/bad/green", "/colours/paprika/2/good/yellow", "/colours/paprika/2/bad/brown", "/colours/banana/yellow" ); my %invert = ( colours=>[0,1,2,3,4], paprika=>[0,1,2,3], banana=>[4], good=>[0,2], bad=>[1,3], red=>[0], green=>[1], yellow=>[2,4], brown=>[3], );

    It's fairly easy to see how for any given query path, you can split it into its sub elements, look up each part in the array of arrays, do a little set arithmetic to locate those paths in the simple array that contains all the requisite matches, and then use the results to find the associated data.

    Building the inverted index can be achieved on-the-fly as the paths array is being constructed, or en-mass afterwards. The only slightly tricky part is remembering the next available index for groups of data, especially when these can be nested, but the final representation should be considerably less memory hungry than the equivalent hash and array tree, and the lookup of partial match query paths quite a lot faster too.


    Examine what is said, not who speaks.
    1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
    2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
    3) Any sufficiently advanced technology is indistinguishable from magic.
    Arthur C. Clarke.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://247194]
Approved by Tanalis
Front-paged by Tanalis
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (4)
As of 2024-03-29 10:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found