Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Sorting URLs on domain/host: sortkeys generation

by parv (Parson)
on Mar 30, 2003 at 03:26 UTC ( [id://246680]=perlquestion: print w/replies, xml ) Need Help??

parv has asked for the wisdom of the Perl Monks concerning the following question:

Howdy, I have the following to sort a hash of URLs based on domain/host. It works well for hosts w/ names but may not work great for IP addresses (verify yourself). Assuming %url contains URLS as the keys & anything else for values (i have page titles from ~/.netscape/history.dat, thanks to Netscape::History)...

printf "%s\n\n" , $_ foreach # extract the sorted complete URL list map { $_->[0] } # sort on massged host name or the complete url sort { $a->[1] cmp $b->[1] || $a->[0] cmp $b->[0] } # set up for extracting the complete URL & sortkeys map { # extract host/domain components in reverse order my @host = reverse split '\.' , ( split '/' , substr($_ , index($_, '://') + 3) )[0 +]; [ # current URL $_ , # put domain at front; everything else afterwords # ---- # poo.koo -> poo.koo # web.poo.koo:80 -> poo.koo:80.www # rand.web.poo.koo -> poo.koo.web.rand # ---- (scalar @host > 1 ? $host[1] . '.' : '') . $host[0] . ( scalar @host < 3 ? '' : '.' . join('.' , @host[2..$# +host]) ) ] } keys %url;

...my question is could the domain/host sortkeys be generated by better/less verbose way?

Replies are listed 'Best First'.
Re: Sorting URLs on domain/host: sortkeys generation
by BrowserUk (Patriarch) on Mar 30, 2003 at 10:20 UTC

    This should get close to what you want. It uses the GRT method of prepending the sort key to the string prior to sorting and stripping it afterwards.

    To generate the sort key, it extracts the domain name from the url, splits it on '.' and reverses the order of the sub-components.

    Ie. http://news.bbc.co.uk/something.htm

    Becomes uk.co.bbc.newshttp://news.bbc.co.uk/something.htm

    This will group .gov.uk together with .co.uk etc. It will also group numeric urls by their quads in the correct order, though they won't be in strictly numeric order as given. Adding code in the sort block to handle this wouldn't be too hard if its a requirement.

    #! perl -slw use strict; open IN, '<links.dat' or die $!; chomp( my @links = <IN> ); close IN; @links = grep m[^\w+://], @links; # remove relative urls. my @sorted = map{ substr($_, 1+index($_, $;)); } sort map{ m[\w+://([^/]+)/]; join( '.', reverse split /\./, $1 ) . $; . $_; } @links; print for @sorted;

    You might also consider adding the protocol to the end of the sort key if any of your urls are non-http links.


    Examine what is said, not who speaks.
    1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
    2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
    3) Any sufficiently advanced technology is indistinguishable from magic.
    Arthur C. Clarke.

      Ah, thanks much! It's in the regex!

      Funny thing... first i tried the regex in ST which became hairy, so i used the "substr()" version. Then, when i converted ST to GRT, didn't think of the regex again. ( Welcome to TunnelVision(R). ) Your version of GRT is definitely short and sweet.

      I will try to post another benchmark results unless somebody else beats me to it.

      First reading about the two transforms and then actually implementing them, i realized that programming is about compromise among readability, speed, typing, and ease of implementation.

      Some direct observations which led to the code presented later...

      • One thing not to forget is the underlying idea of GRT: plain sort is faster than sort w/ sort sub (paraphrased from the sort paper "A Fresh Look at Efficient Perl Sorting").
      • Use of regex puts the GRT below the substr()/index() using GRT, but still above ST. If one could ease the sort criteria, unlike in this case, GRT (even w/ regex) could be used in place of ST.
      • substr() seemed to be a bit faster than split(). So does use of $; in place of "\x00", but this is minute. (Relatively speaking, second case is more of a micro optimizations than the first.)
      • Difference due to quick-, merge-, and stable sorts is so minute (in this case) that benchmarking them is not worth the effort. (micro optimization)
      • Benchmarking is a tedious and easily manipulated process. Number of items to print, what to do w/ the program output, and so on affect the numbers. (an obvious point i suppose.)

      Below are the benchmark results as promised. In the results below, difference between "GRT" and "GRT UK" is that the former doesn't use regexen. Differences between "GRT" and "ST" are inherent ones.

      URL Length Statistics:
        Total URLs: 1496  Lengths' Sum: 216233
        Min: 13  Max: 1270
        Mean: 144.541  Median: 62.000
        Std. Deviation: 235.041  Variance: 55244.207
      
      Benchmark: running GRT, GRT UK, ST for at least 5 CPU seconds...
             GRT:  5 wallclock seconds ( 4.61 usr +  0.39 sys =  5.00 CPU) @  9.00/s (n=45)
          GRT UK:  5 wallclock seconds ( 4.84 usr +  0.36 sys =  5.20 CPU) @  7.69/s (n=40)
              ST:  5 wallclock seconds ( 4.78 usr +  0.24 sys =  5.02 CPU) @  7.17/s (n=36)
               Rate     ST GRT UK    GRT
      ST     7.17/s     --    -7%   -20%
      GRT UK 7.69/s     7%     --   -15%
      GRT    9.00/s    26%    17%     --
      

      You will notice in the code that key generation part is pretty much the same (in order to get the same output) among all (two)three transforms. Code follows...

        A couple of things. I didn't use the GRT in my original post for speed (for a change:), but rather simplicity. In your version you use a regex to extract the domain name and protocol, but then revert to substr/index to get the path info. This can be done using the same regex. You can also greatly simplify the logic for doing your (slightly peculiar:) re-ordering of the parts of the domain name by using the power of slices to acheive your desired ordering.

        The simplification yeilds some performance benefit and matches that of your substr/index version of the GRT, but you can probably implement the same changes to that version and re-establish it's lead.

        sub guttman_rosler_uk { my @host; print STDERR +( $_ , "\n\n") foreach map { substr( $_ , 1+index($_ , $;) ) } sort map { m[(?:(^[^:]+)://)?([^/]+)(.*$)]; @host = (reverse( split '\.' , $2), $3, $1); lc( join'_', @host[1,0, 2..$#host] ) . $; . $_ ; } keys %hist; return; }

        However, a more critical observation is that your method of ordering the sub-components of the domain name could be considered broken.

        Using it, all .co.uk will get sorted together, but these will be a long way away from .ac.uk and .gov.uk etc. Many non-US countries have categorisations within the county-specific TLD. Eg. com.au in Australia etc. In fact I think almost every country except the US does this to some degree or another.

        This was the reason I suggested that you completely reverse the order of the domain name components. That way all .uk are grouped, within that all the .ac, .co, .gov etc. This is why this technique is used for Java .jar naming conventions. Tim Berners-Lee is also on record as saying that if there was one change that he wishes he could make to his definition for the WWW, it is that he wishes he had ordered the domain names in the reverse order.

        The only advantage that I can see with your schema is that ibm.com will get sorted close to ibm.jp and ibm.de, but this will still screw up when you get to ibm.co.uk or ibm.com.au.

        This, I think, is why IBM (and others) have fairly recently moved away from using the country-specific domain names (except to redirect to their main TLD) in favour of ibm.com/uk and ibm.com/jp etc.

        Of course, you know best as to your particular needs, but I thought that I would mention it anyway.


        Examine what is said, not who speaks.
        1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
        2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
        3) Any sufficiently advanced technology is indistinguishable from magic.
        Arthur C. Clarke.

        Below is result of benchmark in which none of the three transform subs print (to STDERR), instead they just return the array reference to the whole thing (sub transform{ return[ map{} sort{} map{} (list) ]; }).

        URL Length Statistics:
          Total URLs: 1497  Lengths' Sum: 216277
          Min: 13  Max: 1270
          Mean: 144.474  Median: 62.000
          Std. Deviation: 234.977  Variance: 55214.052
        
        Benchmark: running GRT, GRT UK, ST for at least 5 CPU seconds...
               GRT:  5 wallclock secs ( 5.05 usr +  0.01 sys =  5.06 CPU) @ 10.07/s (n=51)
            GRT UK:  5 wallclock secs ( 5.08 usr +  0.00 sys =  5.08 CPU) @  8.47/s (n=43)
                ST:  5 wallclock secs ( 5.27 usr +  0.00 sys =  5.27 CPU) @  7.79/s (n=41)
                 Rate     ST GRT UK    GRT
        ST     7.79/s     --    -8%   -23%
        GRT UK 8.47/s     9%     --   -16%
        GRT    10.1/s    29%    19%     --
        
Re: Sorting URLs on domain/host: sortkeys generation
by tachyon (Chancellor) on Mar 30, 2003 at 06:58 UTC

    All you need is a basic cmp sort (for domain name based URLs):

    print for sort <DATA>; __DATA__ http://google.com http://google.com/groups http://google.com/groups/deeper http://msn.com http://msn.com/groups http://msn.com/groups/deeper http://apache.org http://apache.org/docs http://apache.org/docs/mod_perl

    Which gives:

    http://apache.org http://apache.org/docs http://apache.org/docs/mod_perl http://google.com http://google.com/groups http://google.com/groups/deeper http://msn.com http://msn.com/groups http://msn.com/groups/deeper

    For numerical addresses you need to sort on 1) the integer representation of the 4 byte value that corresponds to the IP address then 2) the rest of the URL (if any). This is a little more complex and uses a Schwartzian transform for efficiency. I have assumed dot quads - it you have to deal with other stuff like "127.1" and all the other types of valid IPs Use Socket; my ($ip) = unpack "N", inet_aton($1) This will probably be a little slower than the raw unpack/pack/split presented.

    my @data = qw( http://3.3.3.3/docs/mod_perl http://3.3.3.3/docs http://3.3.3.3 http://2.2.2.2 http://10.1.1.1 http://11.1.1.1 http://2.2.2.2/groups http://2.2.2.2/groups/deeper http://1.1.1.1/groups/deeper http://1.1.1.1/groups http://1.1.1.1 http://1.1.1.2 http://1.1.2.1 ); #use Socket; @sorted = map { $_->[0] } sort { $a->[1] <=> $b->[1] || $a->[2] cmp $b->[2] } map { munge_url($_) } @data; print "$_\n" for @sorted; sub munge_url { my $addr = $_[0]; $addr =~ m!^(?:\w+://)?([^/]+)/?(.*)$!; # convert dot quad to a sortable integer my ($ip) = unpack 'N', pack 'C4', split '\.',$1; # or unpack 'N', +inet_aton($1); my $rest = $2 || ''; print "$ip $rest\n"; return [ $_, $ip, $rest ] } __DATA__ http://1.1.1.1 http://1.1.1.1/groups http://1.1.1.1/groups/deeper http://1.1.1.2 http://1.1.2.1 http://2.2.2.2 http://2.2.2.2/groups http://2.2.2.2/groups/deeper http://3.3.3.3 http://3.3.3.3/docs http://3.3.3.3/docs/mod_perl http://10.1.1.1 http://11.1.1.1

    There is no logical relation between fqdns and dot quad IPs (sort wise) until you resolve the IPs to fqdns.

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

      Yes, i would have been satisfied by a simple sort if the domain/host names were of type scheme :// word 1 . word 2 /? (blah)?. In reallity, there are often more than two words say www.google.com or www.pacareerlink.state.pa.us. Thus the reason for non-trivial sorting.

      Thanks for the tip for sorting by IP addresses though.

        Actually it does not matter that there are \w\.\w... sequences as . sorts before \w you get the desired result. The http:// is also immaterial provided all entries either have (or don't have it). cmp sorting does not stop at the first non word - it simply sorts in ASCII order.

        print "$_\n" for sort qw ( http://. http://www.google.com http://www.google.co.uk http://au.google.com http://au.goo.com http://au.goop.com ); __DATA__ http://. http://au.goo.com http://au.google.com http://au.goop.com http://www.google.co.uk http://www.google.com

        This looks appropriately sorted to me. The IP code will get you the domain (or IP) in $1 regardless so you can easily modify it, but as this shows you don't really need to unless you want to trim off the ftp:// http:// https:// part and thus lump these in one group. The only other modification you can do to the domain name is chop the www. off (trying to guess other subdomains is a hopeless task) Otherwise the default cmp should work fine. Perhaps you could post an example of where it is not?

        cheers

        tachyon

        s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://246680]
Approved by Paladin
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (1)
As of 2024-04-24 13:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found