http://qs321.pair.com?node_id=628981

artist has asked for the wisdom of the Perl Monks concerning the following question:

Hi
I am trying to figure out patterns of URLs on a given site with Parse::RecDescent. How I would go about it? Is there any work done on this? I am familiar with Parse::RecDescent. My question is from the design perspective. Should I use any URL clues that can help me to identify different chunks such as Date, extension, categories? My ultimate purpose is to get the new URLs and put in one of the identified patterns and categorize appropriately.
--Artist

Replies are listed 'Best First'.
Re: Parse::RecDescent for parsing URLs
by castaway (Parson) on Jul 27, 2007 at 06:08 UTC
    Patterns of URLs? What does that mean? Care to post some actual examples.. I can do P::RD, but I have no idea what you mean..

    C.

        Parse::RecDescent is used to create parsers, yet there already exists a parser for URIs. URI and extention URI::QueryParam should do the trick.

        Update: Here's an example:

        use URI qw( ); use URI::QueryParam qw( ); foreach ( 'http://www.perlmonks.org/index.pl?node_id=629153', 'http://www.perlmonks.org/index.pl?node=Recently%20Active%20Threads +', ) { my $uri = URI->new($_); my @node_ids = $uri->query_param('node_id'); my @node_titles = $uri->query_param('node'); if ( (@node_ids && @node_titles) || @node_ids > 2 || @node_titles > 2 ) { warn("$uri: Error: Bad uri\n"); } if (!@node_ids && !@node_titles) { warn("$uri: Warning: Unrecognized uri\n"); next; } if (@node_ids) { print("$uri: By Id ($node_ids[0])\n"); } if (@node_titles) { print("$uri: By Title ($node_titles[0])\n"); } }

        Or maybe you are trying to extract data from a download HTML page? If so, use an existing HTML parser (such as HTML::TreeBuilder and HTML::Tree) instead of rolling out your own.

        I've found XPath to be very useful. HTML::TreeBuilder::XPath allows you to query the HTML document for information. The Firebug extention for Firefox can help you find the paths.

        If PerlMonks is not just an example, I recommend download the XML version of pages by adding the displaytype=xml query parameter to requested URIs. The same advice I gave for HTML applies for XML. Use an existing parser, and XPath is very useful for XML too.

Re: Parse::RecDescent for parsing URLs
by Anonymous Monk on Jul 27, 2007 at 08:22 UTC
    I'm sure GOOGLE has done something on this...