Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Parse::RecDescent for parsing URLs

by artist (Parson)
on Jul 26, 2007 at 18:47 UTC ( [id://628981]=perlquestion: print w/replies, xml ) Need Help??

artist has asked for the wisdom of the Perl Monks concerning the following question:

Hi
I am trying to figure out patterns of URLs on a given site with Parse::RecDescent. How I would go about it? Is there any work done on this? I am familiar with Parse::RecDescent. My question is from the design perspective. Should I use any URL clues that can help me to identify different chunks such as Date, extension, categories? My ultimate purpose is to get the new URLs and put in one of the identified patterns and categorize appropriately.
--Artist

Replies are listed 'Best First'.
Re: Parse::RecDescent for parsing URLs
by castaway (Parson) on Jul 27, 2007 at 06:08 UTC
    Patterns of URLs? What does that mean? Care to post some actual examples.. I can do P::RD, but I have no idea what you mean..

    C.

        Parse::RecDescent is used to create parsers, yet there already exists a parser for URIs. URI and extention URI::QueryParam should do the trick.

        Update: Here's an example:

        use URI qw( ); use URI::QueryParam qw( ); foreach ( 'http://www.perlmonks.org/index.pl?node_id=629153', 'http://www.perlmonks.org/index.pl?node=Recently%20Active%20Threads +', ) { my $uri = URI->new($_); my @node_ids = $uri->query_param('node_id'); my @node_titles = $uri->query_param('node'); if ( (@node_ids && @node_titles) || @node_ids > 2 || @node_titles > 2 ) { warn("$uri: Error: Bad uri\n"); } if (!@node_ids && !@node_titles) { warn("$uri: Warning: Unrecognized uri\n"); next; } if (@node_ids) { print("$uri: By Id ($node_ids[0])\n"); } if (@node_titles) { print("$uri: By Title ($node_titles[0])\n"); } }

        Or maybe you are trying to extract data from a download HTML page? If so, use an existing HTML parser (such as HTML::TreeBuilder and HTML::Tree) instead of rolling out your own.

        I've found XPath to be very useful. HTML::TreeBuilder::XPath allows you to query the HTML document for information. The Firebug extention for Firefox can help you find the paths.

        If PerlMonks is not just an example, I recommend download the XML version of pages by adding the displaytype=xml query parameter to requested URIs. The same advice I gave for HTML applies for XML. Use an existing parser, and XPath is very useful for XML too.

Re: Parse::RecDescent for parsing URLs
by Anonymous Monk on Jul 27, 2007 at 08:22 UTC
    I'm sure GOOGLE has done something on this...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://628981]
Approved by ikegami
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (7)
As of 2024-04-25 08:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found