Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re: Why a regex *really* isn't good enough for HTML, even for "simple" tasks

by ikegami (Pope)
on May 09, 2020 at 08:50 UTC ( #11116601=note: print w/replies, xml ) Need Help??


in reply to Why a regex *really* isn't good enough for HTML and XML, even for "simple" tasks

Your argument is utterly unconvincing. People use regex to extract from HTML documents because it works. They wouldn't use a regex to extract the urls from the document you provided because it wouldn't work.

The real reason not to create a half-assed parser (using regex or otherwise) is this phrase we've all heard: "But it worked yesterday." This is what you'll get with a hacked up solution because it's going to be far less resilient to change and a lot more expensive to maintain than one using a proper parser.

Also, there's a good chance you'll spend far more time developing the hacked up solution as you keep finding corner cases.

Update: Replaced claim the presented task isn't a simple task with an explanation of why isn't one. Sorry, this was done within seconds of posting.

  • Comment on Re: Why a regex *really* isn't good enough for HTML, even for "simple" tasks

Replies are listed 'Best First'.
Re^2: Why a regex *really* isn't good enough for HTML, even for "simple" tasks
by haukex (Bishop) on May 09, 2020 at 08:59 UTC
    Your argument is utterly unconvincing. Noone would claim that parsing that HTML is a simple task.

    Except that's not what I said, and people do try to use regexes to extract stuff from HTML all the time.

    The real reason not to create a half-assed parser (using regex or otherwise) is the following: "But it worked yesterday." A hacked up solution is going to be far less resilient to change and a lot more expensive to maintain than one using a proper parser.

    Which is exactly the argument I made in Parsing HTML/XML with Regular Expressions.

    Update: PerlMonks has a preview function; I won't be responding to your ninja edits. The above quotes represent the entirety of your post at my time of posting.

      people do try to use regexes to extract stuff from HTML all the time.

      I know. And like I said, your argument isn't going to convince a single one of them to stop. They will see their tasks as simple tasks and yours as complex, and you completely failed to show why regex shouldn't be used for simple tasks despite your claims. Perhaps you should add an explanation as to why they shouldn't be used for simple tasks?

        I know. And like I said, your argument isn't going to convince a single one of them to stop. They will see their tasks as simple tasks and yours as complex, and you completely failed to show why regex shouldn't be used for simple tasks despite your claims.

        I see your point now, and I guess that means your initial post could have been something along the lines of "I think your argument might be less effective because people will see their tasks as simple tasks and yours as complex, so how about adding an explanation why regexes still shouldn't be used?". Instead, you chose to be rude.

        Update: Once again, the above quote represents the entirety of your node at the time I saw it and started composing my reply.

        Downvoting and ignoring constructive criticism isn't going to convince the people you are supposedly trying to help. When I say it won't convince them, I mean it has always failed to convince them before. I've seen people have made the same argument countless times to no avail. The best results I've seen have been from showing them it's actually easier to do it right. That even appears to be the message you are trying to send with the examples, so it's really just a question of how you frame the problem!

      Except that's not what I said

      You said: "Why a regex *really* isn't good enough for HTML, even for "simple" tasks". So yeah, you did.

      Which is exactly the argument I made in Parsing HTML/XML with Regular Expressions.

      ok, but it's what you said here I'm commenting on.

        Noone would claim that parsing that HTML is a simple task.

        Since you're active on both PerlMonks and StackOverflow, you must be aware of the fact that scores of people try to pull stuff from HTML using regexes. My node title is what it is as a response to that.

        You said: "Why a regex *really* isn't good enough for HTML, even for "simple" tasks". So yeah, you did.

        Read the what I wrote again keeping in mind what I said above and maybe you'll see that your interpretation of what I said is not what I meant. Unfortunately, it seems that once again your drive to maintain that you are correct appears to be stronger than your drive to be reasonable, so I'm out.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://11116601]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (3)
As of 2020-11-27 05:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?