Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re: Comparative satisfiability of regexps.

by BrowserUk (Patriarch)
on Jan 20, 2005 at 06:31 UTC ( [id://423613]=note: print w/replies, xml ) Need Help??


in reply to Comparative satisfiability of regexps.

I cannot add anything to the debate whether this is doable, but I have to ask why you would want to do this?

Other than an academic answer, "To see if it is possible."; or an enigmatic one "Because I can."; both of which are perfectly fine answers, I cannot see a reason for doing this.

I cannot imagine the situation in which characterising a routine's inputs and or outputs in terms of some restricted subset of regex is the best, or even the easiest, way of characterisation. I guess it could be the most concise, but you generally know much more about an input ot output field than can easily be captured into a regex. Especially at design time?


Examine what is said, not who speaks.
Silence betokens consent.
Love the truth but pardon error.

Replies are listed 'Best First'.
Re^2: Comparative satisfiability of regexps.
by Meowse (Beadle) on Jan 20, 2005 at 10:02 UTC
    Oh, man, you guys are gonna have a field day with this one. *wry grin*

    I'm attempting to answer the question, "Does any XML document described by A.xsd satisfy the constraints of B.xsd?" Which is, as far as I can tell, calculable iff one can calculate the answer to the question, "Does regexp A satisfy the constraints of regexp B?"

    I was quite surprised to find that nobody is apparently even considering the question as it relates to .xsd's, which to my mind misses one of the really great opportunities for proving the correctness of loosely coupled, distributed application networks. Apparently, people are only concerned either with tightly-coupled, pre-negotiated application networks (where correctness is known because application A's output and application B's input are constrained by the same .xsd), or with ad-hoc, one-off verification that this particular instance of application A's output is consistent with application B's input requirements.

    I think this is regrettable, and should be fixed. Hence the preceding discussion.

    Thanks for your patience, and your knowledge, and your help!

    Mickey.

    UPDATE:An .xsd (XML Schema Definition) file defines the allowable names, structures, and contents of XML elements and attributes in an XML file. One might write a calendaring application that took XML input, and whose input was specified by an .xsd ("a .xsd"? "an xsd"? "an .xsd"? Bah!) that required the input XML to contain a date element, a time element, and a schedule_item_name element. And so on.

      I'm attempting to answer the question, "Does any XML document described by A.xsd satisfy the constraints of B.xsd?" ...

      Ouch! Well that told me :)

      I have to say that the thought that crosses my mind is an old joke about the guy that answered the question:

      How can I get to X*, from here?

      *Avoiding racial sterotypes here!

      And got the reply.

      If I were you, I wouldn't be starting from here!

      I happen to think that the benefits of XML have been way oversold, and I'm not even certain that I know what an .xsd is.

      In any case, I wish you the very best of luck in your endevour.


      Examine what is said, not who speaks.
      Silence betokens consent.
      Love the truth but pardon error.
      Like XML, xsd appears to be a language with arbitrary nesting of tagged regions. In addition, it seems to define symbols which are used in other expressions. Regular expressions are not a suitable tool for such a language. You need something like an LALR or recursive descent parser.

      For example, this is a fragment I found:

      <xsd:complexType name="Items"> <xsd:sequence> <xsd:element name="item" minOccurs="0" maxOccurs="unbounded"> <xsd:complexType> <xsd:sequence> <xsd:element name="productName" type="xsd:string"/> <xsd:element name="quantity"> <xsd:simpleType> <xsd:restriction base="xsd:positiveInteger"> <xsd:maxExclusive value="100"/> </xsd:restriction> </xsd:simpleType> </xsd:element> <xsd:element name="USPrice" type="xsd:decimal"/> <xsd:element ref="comment" minOccurs="0"/> <xsd:element name="shipDate" type="xsd:date" minOccurs="0" +/> </xsd:sequence> <xsd:attribute name="partNum" type="SKU" use="required"/> </xsd:complexType> </xsd:element> </xsd:sequence> </xsd:complexType> <!-- Stock Keeping Unit, a code for identifying products --> <xsd:simpleType name="SKU"> <xsd:restriction base="xsd:string"> <xsd:pattern value="\d{3}-[A-Z]{2}"/> </xsd:restriction> </xsd:simpleType>

      It does embed regular expressions here and there to define new data types (the definition of "SKU" for example). If those are all you are trying to validate, our suggestions should work. Otherwise, this is a much more difficult problem.

      Update: I'm also having trouble seeing the utility of this proposed checker. Suppose I have two arbitrary web applications A and B, and both were kind enough to supply xsd schemas. Can I determine if they can interact meaniningfully? Not really. It's a matter of semantics, not just matching schemas. Say both define a "Price" field. How do I know one is not in dollars and the other in euros? I think that's why the usual practice is "pre-negotiated" schemas.

        Update: I'm also having trouble seeing the utility of this proposed checker. Suppose I have two arbitrary web applications A and B, and both were kind enough to supply xsd schemas. Can I determine if they can interact meaniningfully? Not really. It's a matter of semantics, not just matching schemas.

        I hope the OP answers this, because its been interesting so far. But for the calendaring application mentioned, why not create a simple schema and require conformance? It would need to be done per client with xslt (the glue language of xml!) or whatever. But without explicit agreement you are just guessing that elements from different schemas with similar names describe the same real-world objects. Sounds dangerous.

      Um, stupid question time: Since it sounds like you are already using XML (or could), and already have an XSD (or two), why aren't you either:

      • Changing the input and output programs to use the SAME .XSD. Then either end can validate the document against the XSD and assume that if it is valid it will work. Or,
      • Using an XSLT to get from one format to the other. If you have XSD for both, then you should only have to build the XSL transform once, and just run it on the data as it goes through.

      It just really sounds like you're making this too hard. Or I completely don't get it.

      Footnote: For those wondering what XSLT is, it is an XML application that defines transformations that can be applied to XML documents. It can convert XML to XML, HTML, or just about anything else.


      --
      Spring: Forces, Coiled Again!
Re^2: Comparative satisfiability of regexps.
by Molt (Chaplain) on Jan 20, 2005 at 09:30 UTC

    The reason I was guessing at is that they have a lot of data to validate based on a complex regex and were considering using another faster regex as a precheck to save on processing.

    If this was the case then it's to prove that the precheck won't reject anything that the main check would allow.

      That's a pretty good application, though I have to say that I would probably opt for a statistical approach to verification.

      Generate a bunch of random strings, pass them through the stricter test, and then pass it's output through the quicker. If the second results set equals the first, you're good to go.

      With a good knowledge of what the regex are attempting to match, it should be relatively easy to generate random datasets that exercise most of the edgecases, or at least a statistically computable subset thereof.

      And that was my point. Using the knowledge of what the regex are trying to match, and statistics, is probably much easier than try to cross-compare to regex, for which the comparison is only valid if you can first prove that at least one of the regex is itself "correct"--which is an extremely hard task to start with.

      Unless you can:

      • use a definite dataset to prove one of them.

        In which case you could use the same dataset to prove the other.

      • You accept a statistical proof of one.

        Which again, could be used to prove the other.

      Proving the compatibility of two unprovable entities doesn't really buy you a lot unless the comparability test is very quick and cheap--which doesn't appear to be the case here.

      There is no point in measuring something, unless you are pretty sure that your ruler is accurate!


      Examine what is said, not who speaks.
      Silence betokens consent.
      Love the truth but pardon error.
        All true, and well-said, except that the precise commodity desired in this case is proof of the compatibility of two unprovable entities. My code neither knows nor cares what the two regexps "mean", or whether they correctly represent the author's intent; it only cares that output which satisfies the first regexp can be affirmatively stated to satisfy the second regexp.

        In our particular instance, we're attempting to verify the correctness of a meta-application built of sequentially assembled functoids whose input and output contracts are specified as .xsd files. We don't want to know what any given functoid does, or what its parameters (or outputs) represent. We just want to verify that a given sequence of correct functoids is itself correct.

        As a separate note, I'm really disinclined to accept Monte Carlo proofs in the hard sciences. I'll accept them in the soft sciences, but only because there aren't better alternatives. In programming in particular, there's almost always a formal proof of correctness available.

        Thanks much,

        Mickey.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://423613]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (7)
As of 2024-03-28 22:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found