Pattern Search on HTML source.

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Pattern Search on HTML source. by jettero (Monsignor) on Dec 31, 2007 at 16:56 UTC
It's notoriously poor form to try to parse html with regular expressions. You might want to look at HTML::Parser or XML::XPath or one of the million other choices. Your approach might otherwise work (some of the time) except that 1nd should be 1st? `m/\Q<!--\E\s1st table\s\Q-->\E(.?)\Q<!--\E\s\/1st table\s*\Q-->\E/ +g` [download] -Paul	[reply] [d/l]
Re: Pattern Search on HTML source. by sundialsvc4 (Abbot) on Dec 31, 2007 at 17:04 UTC
I very strongly agree: do not use regular expressions to parse HTML code. Do not treat the input that you have been given as a “string.” Dictum Ne Agas: do not do a thing already done. Use one of the many HTML-parsing tools already mentioned to transform the HTML input into a data-structure consisting of Perl hashes and lists. Then, navigate through the structure. (You can even find assistance, in so-called “XPath expressions,” to do that part without having to write custom code.)	[reply]
Re: Pattern Search on HTML source. by ww (Archbishop) on Dec 31, 2007 at 17:09 UTC
Although you have correctly made your regex non-greedy, it "can't get a match" because `1nd` (in the regex) does not match `1st` (in the data): `my $output = "<!-- 1st table -->What I want 1<!-- /1st table -->more s +tuff...<!-- 2st movie -->What I want 2<!-- /2st movie -->...more stuf +f...<!-- 3st movie -->What I want 3<!-- /3st movie -->...more stuff"; if ($output =~ /<!-- 1st table -->(.?)<!-- \/1st table -->/g) { print $1; } else{ print "Nothing Here!"; }` [download] cheerfully spits out `perl 23.pl What I want 1` However* Your `/g` isn't doing what your think. You've tried to specify a single set of tags. `/g` will find the content between them if they're repeated, but it won't find "`2st` ^sic `movie` Your pseudo-html makes no sense: tables without rows or data cells? Using LWP or similar, if you're not, could save you the trouble of saving the source data as a text file It's a tad peculiar to name the input FH in your code as "OUTPUT" and, if you're going to parse html, use a module. There are just too many ways to go wrong while rolling your own.	[reply] [d/l]
Re^2: Pattern Search on HTML source. by Anonymous Monk on Dec 31, 2007 at 19:05 UTC
The problem is that is the tags has sometihing like: `my $output = "<!-- 1st table --> What I want 1<!-- /1st table -->more stuff...<!-- 2st movie --> What I want 2<!-- /2st movie -->...more stuff...<!-- 3st movie -->What + I want 3<!-- /3st movie -->...more stuff";` [download] Like a carriage return or something like that I can't get it to match.	[reply] [d/l]
Re^3: Pattern Search on HTML source. by ww (Archbishop) on Dec 31, 2007 at 19:20 UTC
am: use the `download` download link beneath the code to capture it rather than copy-pasting... or remove the newlines from what you copy-pasted until you have the $output as a single line in your editor. and... updating the previous: I realized, belatedly, that you appear to want to capture the contents of all the tag pairs, rather than just the first. Sorry, the code I posted captures only the first and so far, I haven't worked out a simple (aka, "elegant") and understandable way to do them all with a regex. CF advise to use an html parser or (new suggestion) a module designed to deal with matching pairs. Perhaps wiser monks will offer more particular suggestions.	[reply]
Re: Pattern Search on HTML source. by NetWallah (Canon) on Dec 31, 2007 at 17:12 UTC
While strongly agreeing with the previous 2 posts(Use an HTML parser module), one of the issues you may run into, with your code is described in Death to Dot Star!. for a quick hack, I'd suggest using `([^<]+)` [download] instead. "As you get older three things happen. The first is your memory goes, and I can't remember the other two... " - Sir Norman Wisdom	[reply] [d/l]
Re^2: Pattern Search on HTML source. by Anonymous Monk on Dec 31, 2007 at 20:47 UTC
Why something like this doesn't work? `if ( $line_test=~/<!-- 1st table -->\s(.?)\s*<!-- \/1st table -->/g) +{print $1;}` [download] If it has a carriage return or new line.	[reply] [d/l]
Re^3: Pattern Search on HTML source. by NetWallah (Canon) on Jan 01, 2008 at 03:41 UTC
Try adding an "s" modifer, along with your"g". s Treat string as single line. That is, change "." to match any character whatsoever, even a newline, which normally it would not match. "As you get older three things happen. The first is your memory goes, and I can't remember the other two... " - Sir Norman Wisdom	[reply]
Re: Pattern Search on HTML source. by dwm042 (Priest) on Dec 31, 2007 at 18:49 UTC
I've had success looking at HTML tags and content of tags, on a as needed basis, by grabbing pages with LWP::UserAgent and parsing the result with HTML::TreeBuilder or HTML::TableExtract. With the TreeBuilder module you get a tree of your HTML returned and you can walk down the tree and choose what you want to extract. TableExtract is a more specific parser just for tables. For docs on using HTML::TreeBuilder, there is the very nice HTML::Tree::Scanning. Update: TreeBuilder docs.	[reply]
Re: Pattern Search on HTML source. by Popcorn Dave (Abbot) on Jan 02, 2008 at 00:39 UTC
If you're only trying to get the text between the tags, have you considered using HTML::Strip? Or do you need to know what tags the text is between? If so, have you looked at using HTML::TokeParser? One of those should probably do the trick for you. Revolution. Today, 3 O'Clock. Meet behind the monkey bars. I would love to change the world, but they won't give me the source code	[reply]


XP is just a number
	PerlMonks