Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

Oh, it depends on the data. I mostly agree with you, but I also know that it can be quite appropriate to just grab stuff out of the middle with a simpler regex in many cases. Ironically, I find this particularly true of HTML and XML because you are unlikely to find <em>Price:</em> in a place where it doesn't mean the obvious thing. That is, for example, <a href="<em>Price:</em>"> is very unlikely, even if XML (stupidly) doesn't just outright forbid it (it should, IMIO, require the inner angle brackets to be escaped).

But, of course, there are HTML comments and XML CDATA that do allow for such confusing situations (and I find CDATA a pretty pointless feature). In practice, I find that it is often quite safe to discount this risk, however.

My personal theory is that this mantra against using regexes against HTML (etc.) probably originated from more narrow situations where there were repeated attempts by novice coders to do more systematic manipulation of HTML. For example, someone proposing to use


in a web spider is more deserving of being warned to use a proper parser than someone trying to pull one stock price from an internal web site for personal use.

So the warning has some value even in the latter case, but it is more likely to be overstated. I don't see a rash of hot ranting about such things. It is almost always mentioned, which probably annoys some people.

I rarely see people admit that a simpler regex against something like HTML can often be a reasonable solution, yet I've certainly had much practical success doing such over the years, so I consider myself qualified to state that it can work well. But designing such a solution well requires taking this risk into account, so it can be important to point out the risk. And the response to such comments often encourages restressing the point.

So I can see how someone would think that this point is often being made dogmatically. But I don't see any value in trying to lump them into some imaginary group or apply insulting language to them. (:

- tye        

In reply to Re^5: Breaking The Rules II (regex HTML) by tye
in thread Breaking The Rules II by Limbic~Region

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others perusing the Monastery: (9)
    As of 2020-11-27 09:27 GMT
    Find Nodes?
      Voting Booth?

      No recent polls found