Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

Oh, it depends on the data. I mostly agree with you, but I also know that it can be quite appropriate to just grab stuff out of the middle with a simpler regex in many cases. Ironically, I find this particularly true of HTML and XML because you are unlikely to find <em>Price:</em> in a place where it doesn't mean the obvious thing. That is, for example, <a href="<em>Price:</em>"> is very unlikely, even if XML (stupidly) doesn't just outright forbid it (it should, IMIO, require the inner angle brackets to be escaped).

But, of course, there are HTML comments and XML CDATA that do allow for such confusing situations (and I find CDATA a pretty pointless feature). In practice, I find that it is often quite safe to discount this risk, however.

My personal theory is that this mantra against using regexes against HTML (etc.) probably originated from more narrow situations where there were repeated attempts by novice coders to do more systematic manipulation of HTML. For example, someone proposing to use

/href="([^"]+)"/g

in a web spider is more deserving of being warned to use a proper parser than someone trying to pull one stock price from an internal web site for personal use.

So the warning has some value even in the latter case, but it is more likely to be overstated. I don't see a rash of hot ranting about such things. It is almost always mentioned, which probably annoys some people.

I rarely see people admit that a simpler regex against something like HTML can often be a reasonable solution, yet I've certainly had much practical success doing such over the years, so I consider myself qualified to state that it can work well. But designing such a solution well requires taking this risk into account, so it can be important to point out the risk. And the response to such comments often encourages restressing the point.

So I can see how someone would think that this point is often being made dogmatically. But I don't see any value in trying to lump them into some imaginary group or apply insulting language to them. (:

- tye        


In reply to Re^5: Breaking The Rules II (regex HTML) by tye
in thread Breaking The Rules II by Limbic~Region

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (4)
As of 2024-04-20 02:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found