Using a Regex to extract tagged content

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: Using a Regex to extract tagged content
by davido (Cardinal) on Feb 27, 2004 at 16:13 UTC

I know I could use some sort of XML module for this, but I'd rather do it with a regex.

I'd rather repair my car's transmission with a toothbrush but my automechanic just laughed at me when I asked how to go about it.

Update:

For the best toothbrush you can get ahold of, try Regexp::Common. It has a "balanced" method that might help. If you decide to go the mechanic's tools route, use something along the lines of HTML::LinkExtractor, or HTML::SimpleParse, or HTML::TreeBuilder, or HTML::Parser.

Dave

[reply]

Re: Re: Using a Regex to extract tagged content

by Fletch (Bishop) on Feb 27, 2004 at 16:24 UTC

Hey! It's all ball bearings nowadays. It's so simple maybe you need a refresher course. Now get some 3-in-1 oil and some gauze pads, and I'm gonna need 'bout ten quarts of anti-freeze, preferably Prestone. No, no make that Quaker State.

[reply]

Re: Re: Re: Using a Regex to extract tagged content

by davido (Cardinal) on Feb 27, 2004 at 16:27 UTC

Dave

[reply]

Re: Re: Using a Regex to extract tagged content

by PodMaster (Abbot) on Feb 28, 2004 at 10:46 UTC

YAPE::HTML

MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
** The third rule of perl club is a statement of fact: pod is sexy.

[reply]

Re: Using a Regex to extract tagged content
by gryphon (Abbot) on Feb 27, 2004 at 16:27 UTC

Greetings Anonymous,

Also, I know i could use some sort of XML module for this, but I'd rather do it with regex.

Why? Why go to the trouble of making an incomplete regex that will eventually fail instead of just using a CPAN module? I would strongly recommend you read up about parsers like HTML::TokeParser. You will save yourself a lot of heartache. As a general rule, CPAN is always better than trying to do it yourself. Always.

I'm not sure exactly what you want to pull from your content, but here's a basic example to get you going:

use HTML::TokeParser;
my ($type, $mesg);
my $page = HTML::TokeParser->new(\$content);
while (my $token = $page->get_tag('msg')) {
  $type = $token->[1]{dest};
  $mesg = $token->[3];
}
[download]

gryphon
code('Perl') || die;

[reply]
[d/l]

Re: Using a Regex to extract tagged content
by lestrrat (Deacon) on Feb 27, 2004 at 16:25 UTC

davido gave the correct answer already, but here's a simple hack regexp that matches your particular example.

  /<([a-zA-Z]+)(?:(?:\s+[^>]+)*|\s*)>([^<]*)<\/\1>/
[download]

This *will* break in a lot of cases, so if you can at all, use one of the modules that davido recommended, not a silly hack like this.

[reply]
[d/l]

Re: Using a Regex to extract tagged content
by injunjoel (Priest) on Feb 27, 2004 at 16:35 UTC

just a quick (and minimally tested) thought
my @matches = map{ m|<(\w+)^>*>(^<*)</\1>|g; {'tag' => $1, 'txt' => $2}} $data;
this will return give you and array of hashes (@matches) each hash containing keys 'tag' (the tag name) and 'txt' (the text between the tags).
HTH

UPDATE:

[reply]


Don't ask to ask, just ask
	PerlMonks