Parsing HTML tags with regex

kye has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Parsing HTML tags with regex by BrowserUk (Patriarch) on Oct 03, 2002 at 04:00 UTC
Re-instated as requested. #! perl -sw use strict; use LWP::Simple; my $html = get("http://pvpgnservers.ath.cx"); my @stuff = $html =~ m! <tr>\s+ <td><font\ssize=1><a\shref="bnetd://217.172.178.113/">([^<]+?)</a></fo +nt></td>\s+ <td><a\starget="_blank"\shref="http://www.pure-dream.com"><font\ssize= +1>([^<]+?)</font></a></td>\s+ <td><font\ssize=1>([^<]+?)</font></td>\s+ <td\salign=right><font\ssize=1>([^<]+?)</font></td>\s+ <td><font\ssize=1><a\shref="mailto:webmaster\@pure-dream.com">([^<]+?) +</a></font></td>\s+ <td><font\ssize=1>([^<]+)</font></td>\s+ <td\salign=right><font\ssize=1>([^<]+?)</font></td>\s+ <td\salign=right><font\ssize=1>([^<]+?)</font></td>\s+ </tr>\s+ <tr> !sx; print "@stuff\n"; __DATA__ C:\test>202414 217.172.178.113 Pure-Dream Europe 0d 00:40 DreamDiver PvPGN BnetD + Mod 1.1.6 Linux 42 9 [download]	[reply] [d/l]
Re: Parsing HTML tags with regex by samurai (Monk) on Oct 03, 2002 at 01:45 UTC
Parsing HTML code (correctly) with hand-crafted regexes is not a feat to be undertaken lightly. It has been known to cause chronic headaches in hobbyists and professionals alike. And then, you have to worry about parsing erroneous HTML code... There's a very good reason why people reccomend you use an HTML::* module. But I suppose you should go ahead. You'll learn more than just a bit about regexes, you'll learn why CPAN is so important to the community. -- perl: code of the samurai	[reply]
Re: Re: Parsing HTML tags with regex by tfrayner (Curate) on Oct 03, 2002 at 13:27 UTC
...and once you're tired of it, check out HTML::TableExtract, which was practically written with your exact problem in mind :-) HTH, Tim Update: Okay, I'm lazy, I didn't post the code to actually do the job (mainly because I think it really is that trivial). But the wonderful blakem submitted this node to another thread describing what I was thinking. So go upvote him instead :-)	[reply]
Re: Parsing HTML tags with regex by davorg (Chancellor) on Oct 03, 2002 at 13:23 UTC
I know you don't want to use HTML::foo (tho' you never explain why) but in the interests of having at least one "best practices" answer listed here an HTML::TreeBuilder solution is given below: `#!/usr/bin/perl use warnings; use strict; use LWP::Simple; use HTML::TreeBuilder; my $page = get ('http://pvpgnservers.ath.cx/') or die; my $tree = HTML::TreeBuilder->new; $tree->parse($page); my @trs = $tree->find_by_tag_name('tr'); my @stuff; foreach my $row (@trs) { if ($row->as_text =~ /^217\.172\.178\.113/) { @stuff = map { ref $_ ? $_->as_text : $_ } $row->content_list; last; } } print "@stuff\n";` [download] -- <http://www.dave.org.uk> "The first rule of Perl club is you do not talk about Perl club." -- Chip Salzenberg	[reply] [d/l]
Re: Re: Parsing HTML tags with regex by Chmrr (Vicar) on Oct 03, 2002 at 13:33 UTC
See also this other node in the other thread; I take a slightly different approach to the problem, but also using HTML::TreeBuilder. TIMTOWTDI, indeed. perl -pe '"I lo`+$^X$\"$]!$/"=~m%(.)%s;$_=$1;y^`+*^e v^#$&V"+@( NO CARRIER'	[reply]
Re: Parsing HTML tags with regex by McD (Chaplain) on Oct 03, 2002 at 13:30 UTC
This article is a good place to start. It describes how to do what you want, and some of the risks of doing it that way. Peace, -McD	[reply]
Re: Parsing HTML tags with regex by PodMaster (Abbot) on Oct 03, 2002 at 14:14 UTC
I like HTML::TokeParser a lot, but I LOOOOVE HTML::TokeParser::Simple, so here is an example (cause the others toted such memory hogs as HTML::TreeBuilder, and HTML::Parser doesn't fit for this task) #!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; # friendlier tokens use LWP::Simple; my $html = get("http://pvpgnservers.ath.cx"); =head1 MY Test HTML The "TH" is the 1st trimmeg, so we gotta "seek" to it. Next is a check, to make sure there is a link to index_address.html And if that passes, it means the html ain't changed significantly, so LOOOOOOOOOOOOOOOP while we got TR's { eat a TD and get_trimmed_text 8 times in a row } my $html = q{<tr> <th bgcolor="#808080"><a href="index_adress.html"> +<font size=2>Address</font></a></th> <th bgcolor="#808080"><a href="index_description.h +tml"><font size=2>Description/URL</font></a></th> <th bgcolor="#808080"><a href="index_location.html +"><font size=2>Location</font></a></th> <th bgcolor="#808080"><a href="index_uptime.html"> +<font size=2>Uptime</font></a></th> <th bgcolor="#808080"><a href="index_contact.html" +><font size=2>Contact</font></a></th> <th bgcolor="#808080"><a href="index_software.html +"><font size=2>Software</font></a></th> <th bgcolor="#808080"><a href="index_users.html">< +font size=2>Users</font></a></th> <th bgcolor="#808080"><a href="index_games.html">< +font size=2>Games</font></a></th> </tr> <tr> <td><font size=1><a href="bnetd://211.62.58.113/"> +211.62.58.113</a></font></td> <td><a target="_blank" href="unknown"><font size=1 +>unknown</font></a></td> <td><font size=1>unknown</font></td> <td align=right><font size=1>0d 03:26</font></td> <td><font size=1><a href="mailto:unknown">a PvPGN +user</a></font></td> <td><font size=1>PvPGN BnetD Mod 1.1.6 Linux< +/font></td> <td align=right><font size=1>1158</font></td> <td align=right><font size=1>320</font></td> </tr> }; =cut my $p = new HTML::TokeParser::Simple(\$html); $p->get_tag('th') or die "crap"; die "change code, stuff changed" unless $p->get_tag('a')->return_attr->{href} =~ /index_adress.html/i; while( my $t = $p->get_tag('tr') ) { for(1..8){ $p->get_tag('td'); # cause the next token ain't "text" print $p->get_trimmed_text('/td')."\n"; } } [download] Here are some other examples of HTML::TokeParser and/or HTML::TokeParser::Simple usage. You can get even more by using super search to look for "use HTML::TokeParser" within text. Re: Requesting webpages which use cookies and session ids. (rev) What holiday is today? <!-- googleholiday.pl --> (crazyinsomniac) Re: Getting the Linking Text from a page (crazyinsomniac) Re: Is this the best way to use HTML::TreeBuilder to bold text in an HTML document? download code from scratchpad HTML::TokeParser token dumper (crazyinsomniac) Re: HTML Link Modifier Re: Re: (crazyinsomniac) Re: Extract info from HTML (crazyinsomniac) Re: Extract info from HTML (crazyinsomniac) Re: parsing HTML Re: Parsing HTML tags with regex `____________________________________________________` ** The Third rule of perl club is a statement of fact: pod is sexy. Edit by tye to remove PRE tags around very long lines	[reply] [d/l]


laziness, impatience, and hubris
	PerlMonks