Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

"HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts."
CodingHorror: Parsing HTML the Cthulu Way

The fact is that it's deceptively simple in appearance. What could be so bad about using regex for this? I really like tchrist's explanation on StackOverflow: Oh Yes You Can Use Regexes to Parse HTML!. Here are a few of the many useful statements in that post:

  • "It is true that most people underestimate the difficulty of parsing HTML with regular expressions and therefore do so poorly."
  • "You must decide for yourself whether youíre up to the task of writing what amounts to a dedicated, special-purpose HTML parser out of regexes. Most people are not." -- (By the way, neither am I.)
  • "For jobs where you have a well-defined input set, regexes are obviously the right choice, because it is trivial to put some together when you have a restricted subset of HTML to deal with. Even regex beginners should be handle those jobs with regexes. Anything else is overkill." -- (The implication I take from this is that if your specific need is hard for you to implement using regexes, you're either not at the level of regex beginner, or your input doesn't meet the criteria of being a well-defined, restricted subset, and therefore is not trivial to parse with regexes, and therefore doing so is obviously not the right choice.)
  • "I use parsing classes all the time, especially if it's HTML I haven't generated myself." -- (Did you generate this HTML yourself? Do you have control over its structure? If not, we have a red flag.)
  • "Regexes optimal for small HTML parsing problems, pessimal for large ones" -- (Your problem appears small, but the need for .*? is a red flag to me that it's a larger problem than I would prefer to solve with regexes.)
  • "Iím not going to tell you what you must do or what you cannot do. I think thatís Wrong. I just want to present you with possibilties, open your eyes a bit. You get to choose what you want to do and how you want to do it. There are no absolutes ó and nobody else knows your own situation as well as you yourself do. If something seems like itís too much work, well, maybe it is. Programming should be fun, you know. If it isnít, you may be doing it wrong." -- (Are you having fun? Are you making a good decision here? Are you making a decision with knowledge of the options at your disposal?)

Let's open our eyes, then. Where's your sample input? I don't see any. So now I have to contrive some. I grabbed this from

<!DOCTYPE html> <html> <head> <title>GeeksforGeeks span tag</title> <!-- style for span tag --> <style type=text/css> span{ color: green; text-decoration: underline; font-style: italic; font-weight: bold; font-size: 26px; } </style> </head> <body> <h2>Welcome To GFG</h2> <span>GeeksforGeeks</span></br> <span>GeeksforGeeks</span></br> <span>GeeksforGeeks</span></br> </body> </html>

And now lets open our minds to what a proper DOM class can achieve:

#!/usr/bin/env perl use strict; use warnings; use Mojo::DOM; my $content = <<'HERE'; <!DOCTYPE html> <html> <head> <title>GeeksforGeeks span tag</title> <!-- style for span tag --> <style type=text/css> span{ color: green; text-decoration: underline; font-style: italic; font-weight: bold; font-size: 26px; } </style> </head> <body> <h2>Welcome To GFG</h2> <span>GeeksforGeeks</span></br> <span>GeeksforGeeks</span></br> <span>GeeksforGeeks</span></br> </body> </html> HERE my $dom = Mojo::DOM->new($content); $dom->find('span')->map('remove'); print "$dom\n";

This produces:

<!DOCTYPE html> <html> <head> <title>GeeksforGeeks span tag</title> <!-- style for span tag --> <style type="text/css"> span{ color: green; text-decoration: underline; font-style: italic; font-weight: bold; font-size: 26px; } </style> </head> <body> <h2>Welcome To GFG</h2> </body> </html>

The beauty here is that you don't have to worry about what happens in the case of nested spans, which the regex you're producing doesn't look like it would deal with gracefully. And you don't have to worry about a whole bunch of other nuances, such as the fact that <span and <     span are equivalent (but also not handled by the regex you were crafting).

And what's the cost? In terms of non-core Perl modules, you've added:

All of those are part of the Mojolicious distribution, which is distributed as a 776kb tarball, and installable with only core Perl tools. Additionally, this distribution provides you with a nice User Agent, and a great test framework. It takes under a minute to install, and has no non-core dependencies.

My own take: It may be that I could take a stab at writing an HTML parser that would remove span tags, and also the other things you were asking about the day before, and the things you will ask about tomorrow, for the subset of HTML you deal with in your specific use case. And I might get it right. But I will have wasted a lot of time to implement a more fragile solution to a very specific problem, and it would be a tool that couldn't grow as my problem evolves.

I don't know what your job is, but my job is not to spend more time than necessary to create a less robust, more buggy solution to an already solved problem if I'm aware a shorter-time-to production, more robust, less buggy, easier to understand approach exists. Now I may sometimes manage to do that unintentionally anyway. But when shown the light, I realize that part of my job is to learn, and evolve to adopt the better approach.


In reply to Re: regex in perl by davido
in thread regex in perl by bigup401

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others making s'mores by the fire in the courtyard of the Monastery: (6)
    As of 2021-04-20 22:48 GMT
    Find Nodes?
      Voting Booth?

      No recent polls found