Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Using a Regex to extract tagged content

by Anonymous Monk
on Feb 27, 2004 at 16:09 UTC ( [id://332285]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I've been looking around through some regex docs and I can't find out how to do this:

I have strings like "<msg dest="*">hey</msg>".

And I want to get the 'msg' from the starting tag, and everything between "<msg dest="*">" and "</msg>".

What I've come up with is as follows:

my ($type,$mesg) = ($data =~ /<([A-Za-z]+) [+>]+>(.+)<\/$1>/);


if that makes sense. the regex needs to use what was found in the ([A-Za-z]+) to find a matching end tag for it. I read about how to do this in the perldocs that come with windows, but im on linux now and can't find it...

Also, I know i could use some sort of XML module for this, but I'd rather do it with regex.

thanks

Regards,
Jasper

Replies are listed 'Best First'.
Re: Using a Regex to extract tagged content
by davido (Cardinal) on Feb 27, 2004 at 16:13 UTC
    I know I could use some sort of XML module for this, but I'd rather do it with a regex.

    I'd rather repair my car's transmission with a toothbrush but my automechanic just laughed at me when I asked how to go about it.

    Update:

    For the best toothbrush you can get ahold of, try Regexp::Common. It has a "balanced" method that might help. If you decide to go the mechanic's tools route, use something along the lines of HTML::LinkExtractor, or HTML::SimpleParse, or HTML::TreeBuilder, or HTML::Parser.


    Dave

      Hey! It's all ball bearings nowadays. It's so simple maybe you need a refresher course. Now get some 3-in-1 oil and some gauze pads, and I'm gonna need 'bout ten quarts of anti-freeze, preferably Prestone. No, no make that Quaker State.

        Fletch (the movie).


        Dave

      *cough* how to classify this toothdrill? YAPE::HTML - Yet Another Parser/Extractor for HTML

      MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
      I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
      ** The third rule of perl club is a statement of fact: pod is sexy.

Re: Using a Regex to extract tagged content
by gryphon (Abbot) on Feb 27, 2004 at 16:27 UTC

    Greetings Anonymous,

    Also, I know i could use some sort of XML module for this, but I'd rather do it with regex.

    Why? Why go to the trouble of making an incomplete regex that will eventually fail instead of just using a CPAN module? I would strongly recommend you read up about parsers like HTML::TokeParser. You will save yourself a lot of heartache. As a general rule, CPAN is always better than trying to do it yourself. Always.

    I'm not sure exactly what you want to pull from your content, but here's a basic example to get you going:

    use HTML::TokeParser; my ($type, $mesg); my $page = HTML::TokeParser->new(\$content); while (my $token = $page->get_tag('msg')) { $type = $token->[1]{dest}; $mesg = $token->[3]; }

    gryphon
    code('Perl') || die;

Re: Using a Regex to extract tagged content
by lestrrat (Deacon) on Feb 27, 2004 at 16:25 UTC

    davido gave the correct answer already, but here's a simple hack regexp that matches your particular example.

    /<([a-zA-Z]+)(?:(?:\s+[^>]+)*|\s*)>([^<]*)<\/\1>/

    This *will* break in a lot of cases, so if you can at all, use one of the modules that davido recommended, not a silly hack like this.

Re: Using a Regex to extract tagged content
by injunjoel (Priest) on Feb 27, 2004 at 16:35 UTC
    Greetings all,
    just a quick (and minimally tested) thought
    my @matches = map{ m|<(\w+)^>*>(^<*)</\1>|g; {'tag' => $1, 'txt' => $2}} $data;
    this will return give you and array of hashes (@matches) each hash containing keys 'tag' (the tag name) and 'txt' (the text between the tags).
    HTH

    UPDATE: obviously map still eludes me! disregard the response above.
    -Joel

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://332285]
Approved by arden
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (4)
As of 2024-04-18 01:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found