Parse::RecDescent for parsing URLs

artist has asked for the wisdom of the Perl Monks concerning the following question:

Hi
I am trying to figure out patterns of URLs on a given site with Parse::RecDescent. How I would go about it? Is there any work done on this? I am familiar with Parse::RecDescent. My question is from the design perspective. Should I use any URL clues that can help me to identify different chunks such as Date, extension, categories? My ultimate purpose is to get the new URLs and put in one of the identified patterns and categorize appropriately.

--Artist

Comment on Parse::RecDescent for parsing URLs Download Code

Replies are listed 'Best First'.
Re: Parse::RecDescent for parsing URLs by castaway (Parson) on Jul 27, 2007 at 06:08 UTC
Patterns of URLs? What does that mean? Care to post some actual examples.. I can do P::RD, but I have no idea what you mean.. C.	[reply]
Re^2: Parse::RecDescent for parsing URLs by artist (Parson) on Jul 27, 2007 at 17:46 UTC
I am looking to extract patterns of URL from given sites. Example: http://www.perlmonks.org/index.pl?node_id=629153 is a valid question-answer node. Where as http://www.perlmonks.org/index.pl?node=Recently%20Active%20Threads is not. There is a certain pattern follows here that node_id=\d+ is a valid question-answer node. Extracting these type of patterns from given site, can help me to determine the nature of the link. I like to do these site-wide, automatically. Hopefully, I am making sense here. --Artist	[reply] [d/l]
Re^3: Parse::RecDescent for parsing URLs by ikegami (Patriarch) on Jul 27, 2007 at 17:53 UTC
Parse::RecDescent is used to create parsers, yet there already exists a parser for URIs. URI and extention URI::QueryParam should do the trick. Update: Here's an example: use URI qw( ); use URI::QueryParam qw( ); foreach ( 'http://www.perlmonks.org/index.pl?node_id=629153', 'http://www.perlmonks.org/index.pl?node=Recently%20Active%20Threads +', ) { my $uri = URI->new($_); my @node_ids = $uri->query_param('node_id'); my @node_titles = $uri->query_param('node'); if ( (@node_ids && @node_titles) \|\| @node_ids > 2 \|\| @node_titles > 2 ) { warn("$uri: Error: Bad uri\n"); } if (!@node_ids && !@node_titles) { warn("$uri: Warning: Unrecognized uri\n"); next; } if (@node_ids) { print("$uri: By Id ($node_ids[0])\n"); } if (@node_titles) { print("$uri: By Title ($node_titles[0])\n"); } } [download]	[reply] [d/l]
Re^3: Parse::RecDescent for parsing URLs by ikegami (Patriarch) on Jul 27, 2007 at 18:35 UTC
Or maybe you are trying to extract data from a download HTML page? If so, use an existing HTML parser (such as HTML::TreeBuilder and HTML::Tree) instead of rolling out your own. I've found XPath to be very useful. HTML::TreeBuilder::XPath allows you to query the HTML document for information. The Firebug extention for Firefox can help you find the paths. If PerlMonks is not just an example, I recommend download the XML version of pages by adding the `displaytype=xml` query parameter to requested URIs. The same advice I gave for HTML applies for XML. Use an existing parser, and XPath is very useful for XML too.	[reply] [d/l]
Re: Parse::RecDescent for parsing URLs by Anonymous Monk on Jul 27, 2007 at 08:22 UTC
I'm sure GOOGLE has done something on this...	[reply]

Back to Seekers of Perl Wisdom