Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

I wasn't targetting P::RD. It is a perfectly fine module for those situations where you need to extract full semantic information from the language you are analysing. But even for this, it's certainly not the only game in town, nor necessarily the best choice for any given application.

With respect to shift/reduce conflicts and Parse::YAPP: It's possible to construct ambiguous grammars regardless of which type of parser one targets, and equally possible to resolve them.

My main point was that parsers in general aren't an easy to learn and use, alternative to regex. Especially when a lot of the time when people say: I want to parse ...; they often don't want to parse at all. They simply want to extract some information, that may happen to be embedded within some other information.

For example, for the vast majority of screen scraping applications, the user has no interest whatsoever in extracting any semantic or syntactic information from the surrounding text. Even if that surrounding text happens to be in a form that may or may not comply with one of the myriad variations of some gml-like markup.

Their only interest is locating a specific piece of text that happens to be embedded within a lot of other text. There may be some clues in that other text that they will need to locate the text they are after, but they couldn't give two hoots whether that other text is self-consistant with some gml/html/xhtml standard.

For this type of application, not only does parsing the surrounding html require a considerable amount of effort and time--both programmer time and processor time--given the flexibility of browsers to DWIM with badly written HTML/XTML, it would often set the programmer on a hiding to nothing to even try. Luckily, HTML::Parser and freinds are pragmatically and specifically written to gloss over the finer points of those standards and operate in a manner that DWTALTPWAMs (Do What The Average, Less Than Perfect, Web Author Means).

Even so, after 5 years, I have still to see any convincing argument against the opinions I expressed when I wrote Being a heretic and going against the party line.. I still find it far quicker to use a 'bunch of regex' to extract the data I want from the average, subject-to-change, website than to work out which combination of modules and methods are required to 'do it properly'. And when things change, I find it easier to adjust the regex than figure out which other module or modules and methods I now require.

I think that there is an element of 'laziness gone to far' in the dogma that regex is "unreadable, unmaintainable and hard". It is a complex tool with complex rules, just as every parser out there. You have to learn to use it, just as with every other parsing tool out there. It has limitations just like every other parser out there.

And there are several significant advantages of learning to use regex, over every other parsing tool out there.

  1. It's always available.
  2. It is applicable to every situation.

    Left recursive; right recursive; top down; bottom up; nibbling; lookahead; maximal chunk; whatever.

  3. You have complete control.

    Need to perform some program logic part way through a parse? No problem, use /gc and while.

    Need to parse a datstream on the fly. No problem, same technique applies.

    Want to just skip over stuff that doesn't matter to your application. No problem. Parse what you need to, skip over what you don't. You don't have to cater for all eventualities, nor restrict yourself to dealing with data that complies to some formalised, published set of rules.

  4. It's fast.

Mostly, take the time to learn to use regex well and you'll not need to run off to cpan to grab, and spend time learning to use, one of ten new modules, each of which purport to do what you need, but each of which has its own set of limitations and caveats.

I have a regex based parser for math expressions, with precedence and identifiers, assignment and variadic functions. It's all of 60 lines including the comprehensive test suite! One day I'll get around to cleaning it up and posting it somewhere.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Re^3: Breaking The Rules II by BrowserUk
in thread Breaking The Rules II by Limbic~Region

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (4)
As of 2024-04-19 16:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found