Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Operator for "these expressions, in any order"

by jsalvata (Initiate)
on Feb 17, 2004 at 15:37 UTC ( [id://329626]=perlquestion: print w/replies, xml ) Need Help??

jsalvata has asked for the wisdom of the Perl Monks concerning the following question:

Hi.

In analising some HTML using using regexps, I need to match HTML tags containing certain attributes independently of the order in which these attributes appear -- but ensuring that each of the attributes appears once.

To do this, I would need a reasonably concise and efficient way to build:

from two regexps X and Y, a regexp almost equivalent to

XY|YX

from three regexps X, Y and Z, a regexp almost equivalent to

XYZ|XZY|YXZ|YZX|ZXY|ZYX

... you see the pattern. The "almost" refers to the fact that I would like captured groups in X, Y, Z to be accessible as if only the first case in the alternate (|) construct was there.

I'm currently using

(?:X|Y){2} (?# for the first example) (?:X|Y|Z){3} (?# for the second)

which is very very close to what I need --and works well for my particular job-- but is not formally correct, since it would give a false positive if any of the subexpressions X, Y, Z were matched twice.

Is there any way to 'burn out' those X, Y, Z so that they can only match once? I could use something in the lines of:

(?:(?(1)NOTMATCH|(X))|(?(2)NOTMATCH|(Y))|(?(3)NOTMATCH|(Z))){3}

where NOTMATCH is a regexp that never matches. But it gets unmanageable if X, Y, Z contain capture groups.

Note that this is a problem I often run against when using regexps -- not only in the context of analising HTML. Would it make sense to extend regexp syntax to include a combining operator (like XY or X|Y) meaning 'match these in sequence but in any order?' e.g. X&Y?

All comments welcome.

Replies are listed 'Best First'.
•Re: Operator for "these expressions, in any order"
by merlyn (Sage) on Feb 17, 2004 at 15:47 UTC
      Just another Randal... I like this solution better. :-)

      That would match on "XYZZ", of which I got the impression the OP doesn't want a match.

      Abigail

Re: Operator for "these expressions, in any order" (both)
by tye (Sage) on Feb 17, 2004 at 16:45 UTC

    You can combine a couple of near-solutions:

    /^(?=.*X)(?=.*Y)(?=.*Z)(X|Y|Z){3}$/

    If X, Y, and Z are distinct enough, then this should be a total solution. If, for example, X is a subpattern of Y, then it would match YYZ.

    - tye        

      Excellent! In my case, X, Y Z... are definitely unmistakable.

      I'll just need to check this for performance (yes, just to make my problem more difficult, performance is also very relevant). Fortunately I have something more restrictive to use in lieu of the ".*"...

Re: Operator for "these expressions, in any order"
by Roger (Parson) on Feb 17, 2004 at 15:44 UTC
    use Algorithm::Permute; my @patterns = (); my $p = new Algorithm::Permute(['X', 'Y', 'Z']); while (@res = $p->next) { push @patterns, join('', @res); } my $pattern = join '|', @patterns; print "$pattern\n"; # give the pattern a test... while (<DATA>) { chomp; print "$_ is ", /^(?:$pattern)$/ ? 'good' : 'bad', "\n"; } __DATA__ XYX XYZ ZYY ZXY ZYX XXYZ

    And the output is ...
    ZYX|YZX|YXZ|ZXY|XZY|XYZ XYX is bad XYZ is good ZYY is bad ZXY is good ZYX is good XXYZ is bad

      Unfortunally, the size of the resulting regex grows exponentially. There are 6 permutations for 3 choices, 24 for 4, 120 for 5, 720 for 6, and 3628800 for 10.

      Abigail

        Not to be a stickler, but it's not exponential growth. It's worse than that. For instance, 2**4 = 16, but 4! = 24. Choose any base j, and there's a number n such that j**n < n!. Of course, this only strenghens your point...:).

        thor

        Yes I know. That's why I like Randal's solution better. :-)

      I thought I already provided this answer myself.

      However, this does not address the capture-groups issue I mentioned in my question: if X, Y and Z all had capture groups, obtaining their values (and knowing which regexp each comes from) would require some funky code that I need to avoid.

      Let me clarify my problem: the issue is not how to build a matching regexp, but how to keep the capture groups reasonable. The reason for this is that X Y Z would be build from regexps entered by the program users, who would also provide a result pattern, possibly containing $1 $2,... So I need the correspondence between the capture groups in the input regexps and the vars to be easy to describe.

Re: Operator for "these expressions, in any order"
by Abigail-II (Bishop) on Feb 17, 2004 at 16:12 UTC
    #!/usr/bin/perl use strict; use warnings; while (<DATA>) { chomp; print "$_: ", /^(?{ %choices = map {;"\Q$_" => 1} qw !X Y Z! }) (?: (?!)? # Will never match, but keeps perl from +whining. ((??{ keys %choices ? join "|" => keys %choices : "(?!)" })) (?{ delete $choices {$1} }))+ (?(?{ keys %choices })(?!)|)$/x ? "ok\n" : "not + ok\n"; } __DATA__ XYZ ZYX ZXY XY ZYYX ABC XYZ: ok ZYX: ok ZXY: ok XY: not ok ZYYX: not ok ABC: not ok

    Abigail

Re: Operator for "these expressions, in any order"
by dragonchild (Archbishop) on Feb 17, 2004 at 19:29 UTC
    I seem to always be the one doing this, but here goes:

    Why are you matching HTML with regexes? It's dangerous and fraught with peril, as well as being impossible to maintain or get right. Why not use something like, oh, HTML::Parser and have it deal with the problem of how to figure out what has way and you just ask it "Does tag ABC have attributes X, Y, and Z?"

    Or ... attack the problem another way. Either these pages are static or they're not. If they are, then read them by hand. No matter how many you have, so long as they don't change, you'll finish, eventually. (A very large amount and q.v. solution #1.)

    If they're generated in some fashion, then don't examine the output, examine the generator! A quick code review with a colleague and a whiteboard will quickly tell you if you're double-generating attributes. Now, if the code is dense and impenetrable, that's a good reason to rewrite it, and in the process guarantee that this issue is a non-starter.

    Now, you might have issues with the idea of HTML being embedded in the code. Get it out and use templates. HTML doesn't belong in code, and vice-versa.

    Of course, this entire discussion begs the question - why aren't you using CSS?

    ------
    We are the carpenters and bricklayers of the Information Age.

    Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

      I thought I would get one of these, I should have answered all these in my initial posting... just it is long to explain and I'm too lazy.

      > Why are you matching HTML with regexes?

      Because I'm only interested in very small parts of the file, and by hard experience I've learned this is the fastest way.

      Background information:

      Well, I have to admit that my question was a Perl Regexp question, but not a Perl question. The problem at hand is coding for Apache Jakarta JMeter, a Java load/performance testing application. There are several situations in which we need to analyze HTML, but it's never the whole thing, but just small bits of it. For example obtaining values in a particular hidden field to pass in a later request, or obtaining the URLs of embedded elements (images, CSSs, etc.) to download them too.

      Hope this answers your question on why we need to examine the output and not the generator.

      As for not doing the wrong thing upon oocurence of double-attributes, I'm not too worried about that -- I currently live happily with the (?:X|Y|Z){3} -- it's more of a "how would I do it?" question than a "how do I do it"?

      Still, JMeter is a test tool, and it should help you detect problems in your code (most relevantly performance problems and problems that only happen under load). Code review, as you suggest, is another way -- a complementary one, not an alternative one.

      > Why not use something like, oh, HTML::Parser ... ?

      We have implemented three alternative solutions: one based on HtmlParser, one on JTidy, and a crappy one I wrote using regexps. I am aware that the later can never be formally correct, but it is currently the fastest of the three and, I'm proud to say, the most reliable in real-world situations so far.

      Hope I've addressed all your relevant concerns.

      Salut,

      Jordi.

Re: Operator for "these expressions, in any order"
by CountZero (Bishop) on Feb 17, 2004 at 19:31 UTC
    Did you consider using HTML::Parser? it will parse the HTML (no need to use regexen to match the tags) and the attributes are presented to you in a nice hash, which solves your permutation "problem".

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://329626]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (4)
As of 2024-04-25 15:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found