Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Longest Matching URL

by marceus (Initiate)
on Mar 20, 2002 at 17:05 UTC ( #153027=perlquestion: print w/replies, xml ) Need Help??

marceus has asked for the wisdom of the Perl Monks concerning the following question:

Given an unsorted list of URLs (ie from a DBI query) and a target URL to match ... I need to find the longest Matching URL in the unsorted list ... and the portion of the Target URL that isn't matched.

For example if the list contains:
http://www.someplace.com/foo
http://www.someplace.com/foo/bar
http://www.someplace.else.com/foo

and the target is:
http://www.someplace.com/foo/fighter/index.html

it should return:
http://www.someplace.com/foo, fighter/index.html

I have the following code (generalized for this post):

sub getMapping = { my $uri_in = shift; $uri_in =~ m!.*://([^/]*)!; my $domain = $1; my $dbh = getDBH(); my $path = ''; my $query = qq{ SELECT ftpID, uri, path # first get all records that FROM tURIMapping # refer to the domain WHERE uri REGEXP '.*$domain.*'; }; my $sth = $dbh->prepare($query); $sth->execute; my $aMapping = undef; # now let's search for the best match my $i = '0'; while ($aMapping = $sth->fetchrow_arrayref()){ # while there are still mappings my $target = $uri_in; # set the target while ($i < 5) { last if $aMapping->[1] eq $target; # exit if this is the mapping $target =~ s!(^\w+://.*)/(.*)$!$1!i; # remove the last part of th +e uri $path = '/'.$2.$path if $2; # save the last part as the path last if $target =~ m!^\w+://[^/]/?$!i; # exit if this just a doma +in $i++; } # end while ($i) last if $aMapping->[1] eq $target; # double check this is the targe +t } # end while ($aMapping = $sth->fetchrow_arrayref()) die "No Mapping Found for $uri_in" unless $aMapping; my ($ftpID, $uri, $map_path) = @$aMapping; # if the ftpID is null my $type = $ftpID? 'ftpAccount':'filesystem'; # this is a filesystem $sth->finish; $dbh->disconnect; return ($type, $map_path.$path, $ftpID); };
The code did come at a point where I was just beginning to write perl after a long haitus so I apologize for the "unique" style. This code has been one of the buggiest least robust parts of the entire project. So I am here asking if this is the best way? Is there a better Algorithm/Solution? Is the code just riddled with bugs?

Replies are listed 'Best First'.
Re: Longest Matching URL
by Corion (Patriarch) on Mar 20, 2002 at 17:34 UTC

    If you have control over the database, I'd first of all restructure the database to get rid of the REGEXP $uri part in your SQL statement - simply introduce either a new column called host, or, to completely stay within the relational mindset, introduce a new table in which you store all your hosts, and introduce a new column in tURIMapping, in which you store references to all hosts. The second idea is cleaner in the sense of pure relational databases and normalisation, but the first alternative is much easier to implement.

    Searching for the "best" (==longest) match is done easily if you sort the list by the length of the entries in descending order. I'm not sure if you can convince SQL to do this with a clever ORDER BY len() clause, but you could introduce a second column, which stores the length of each URL, and modify your select statement to SELECT ftpID,uri,path FROM tURIMapping WHERE host=? ORDER BY length DESC.

    That way, the database will do most of the work for you, and you now only need to walk the results until you find the first string that (partly) matches your searchstring - as the database has ordered your results, you can guarantee that this will be the longest possible match.

    Note that, if you decide to split off the hostname from the rest of the uri, you will have to slightly modify the way you construct your return values and the values you put into the query.

    For the uri parsing part, I recommend taking a look at URI::URL module, which nicely splits up a lot of obscure uris.

    perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web
Re: Longest Matching URL
by AidanLee (Chaplain) on Mar 20, 2002 at 17:58 UTC

    untested:

    my @candidates = qw( http://www.someplace.com/foo http://www.someplace.com/foo/bar http://www.someplace.else.com/foo ); my $target = 'http://www.someplace.com/foo/fighter/index.html'; my %leader = ( uri => '', length => 0 ); foreach my $candidate (@candidates) { if( $target =~ /^$candidate/ ) { if( length($candidate) > $leader{length} ) { @leader{'uri','length'} = ($candidate,length($candidate)); } } } if( $leader{uri} ) { print $leader{uri},',',substr($target,$leader{length}); } else { print "no matches" }
      That works beautifully ... thanks.

      I remember I initally kept clearn of length() because I was afraid it would cut off at odd places in the URL. I suppose like most new (or new again) coders I forgot my KISS principle.

      Thank you.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://153027]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (6)
As of 2022-06-28 19:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My most frequent journeys are powered by:









    Results (92 votes). Check out past polls.

    Notices?