Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Re^5: aXML vs TT2

by anneli (Pilgrim)
on Oct 22, 2011 at 02:43 UTC ( [id://933013]=note: print w/replies, xml ) Need Help??


in reply to Re^4: aXML vs TT2
in thread aXML vs TT2

So what does your system do right now when it sees a tag like <tag />? Does it get ignored?

Replies are listed 'Best First'.
Re^6: aXML vs TT2
by Logicus (Initiate) on Oct 22, 2011 at 03:05 UTC

    The parser basically works in a two phase process, first it scans using a very fast non-backtracking regex, to see if it can find any opening tags it recognises. If it succeeds it marks the tag with a control character, then runs a looped slower regex which scans for a complete open and close tag set. When it has something it knows is valid, and does not contain a nested tag set (as determined by negating the control character), it then executes the relavent code and substitutes the return value into the document. The loop continues until there are no more matching sets to process.

    The first phase negates the backslash to prevent it from picking up on and marking out the close tags so the tag you mentioned will probably be ignored by both phases and remain untouched in the final output. I'd have to check back on the actual regex used to be certain but I'm pretty sure that is correct.

      Have you considered writing a proper state machine-based lexer/parser? It would positively fly, compared to using regular expressions, and probably end up more amenable to extension (if you wanted to actually support XML, for instance).

        I have made many attempts at building a character by character state-machine for it, all of which failed miserably and ended up breaking the functionality in some way or other.

        I do believe it is possible to construct such a thing, and I do believe that if I had time and inclination to do so that eventually I would solve the problem, but since I have now adopted Plack as the basis for the system to sit upon I'm already getting close to a thousand hits a second out of it using a Quad Core Phenom II as it stands. (16,000 hits in just over 18 seconds, last time I ran my stress testing script)

        Given that 24 and 32 processor server systems are readily available and 100+ processor core chips are already in production by companies such as tilera, I don't feel at this stage that there is much point to further optimisation as all the "low hanging fruit" has already been had by optimising the regexes and removing backtracking etc where possible.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://933013]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (7)
As of 2024-04-18 20:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found