http://qs321.pair.com?node_id=289360

All: Does anyone see a need for a module that will allow you to retrieve hash keys in a pre-defined sorted order? This functionality can be used with any user defined sort routine, can be changed on the fly, works on multi-dimensional hashes, and DWYM if you decide to add key/values later.

There are two modules already that do similar things:

  • Tie::IxHash
  • Tie::SortHash

    The problem with Tie::IxHash is that it only preserves insertion order as well as provide rudimentary ability and getting keys/values in sorted order.

    The problems with Tie::SortHash are:

  • It uses pseudo hashes
  • It uses eval string to accomplish the arbitrary sort
  • It loops (n2 + n) / 2 times through the hash for keys, n = # of keys
  • It uses its own form of garbage collection
  • The test suite is not exhaustive
  • The calling syntax is not flexible, making expansion extremely difficult
  • Originally, I thought I would just contact the author with suggestions on how to correct these deficiencies or offer to take over maintenance. Almost a month has gone by with no response. I originally posted this as a proposed inplace upgrade for the module. Upon revisiting it yesterday, I have decided that if I do anything, it will be to upload a brand new module. It is too hard to maintain backwards compatability and have the module robust.

    The question I have is - why, what's the point? I am not aware that there is a big need for this module. I would certainly not want to put more code up on CPAN without a good reason.

    The new module that I have worked on is below, though it still unpolished, requires the POD to be finished, and needs an exhaustive test suite.

    #!/usr/bin/perl -w package Tie::SortedHash; use strict; use Carp; use constant HASH => 0; use constant LOOKUP => 1; use constant ARRAY => 2; use constant SORT => 3; use constant CHANGED => 4; use constant OPT => 5; our $VERSION = '1.00'; sub TIEHASH { my $class = shift; croak "Incorrect number of parameters" if @_ % 2; my %options = @_; my $self = bless [], $class; $self->_Build(\%options); return $self; } sub FETCH { my($self, $key) = @_; $self->[HASH]{$key}; } sub STORE { my($self, $key, $value) = @_; $self->[HASH]{$key} = $value; $self->[CHANGED] = 1; } sub EXISTS { my($self, $key) = @_; exists $self->[HASH]{$key}; } sub DELETE { my($self, $key) = @_; delete $self->[HASH]{$key}; if ($self->[OPT] == 2 && exists $self->[LOOKUP]{$key}) { splice(@{$self->[ARRAY]}, $self->[LOOKUP]{$key}, 1); delete $self->[LOOKUP]{$key}; } } sub FIRSTKEY { my $self = shift; $self->_ReOrder if $self->[OPT] == 1 || ($self->[OPT] == 2 && $sel +f->[CHANGED]); $self->_Iterate; } sub NEXTKEY { my ($self, $lastkey) = @_; $self->_Iterate($lastkey); } sub CLEAR { my $self = shift; $self->[HASH] = {}; $self->[CHANGED] = 1; } sub DESTROY { } sub _Build { my ($self, $opt) = @_; my $sort = exists $opt->{SORT} ? $opt->{SORT} : sub { my $hash = shift; sort {$a cmp $b || $a <=> $b} keys %$hash; }; $self->sortroutine($sort); my $hash = exists $opt->{HASH} ? $opt->{HASH} : {}; croak "$hash is not a hash ref" if ref $hash ne 'HASH'; @{$self->[HASH]}{keys %$hash} = values %$hash; if (exists $opt->{OPTLEVEL}) { croak "$opt->{OPTLEVEL} is not valid optimization level" if $o +pt->{OPTLEVEL} !~ /^[123]$/; $self->[OPT] = $opt->{OPTLEVEL}; } else { $self->[OPT] = 1; } } sub _ReOrder { my $self = shift; $self->[LOOKUP] = (); $self->[ARRAY] = (); my $index = 0; for my $key ($self->[SORT]($self->[HASH])) { $self->[LOOKUP]{$key} = $index; $self->[ARRAY][$index] = $key; $index++; } $self->[CHANGED] = 0; } sub _Iterate { my ($self, $lastkey) = @_; if ($self->[OPT] != 3) { my $index = defined $lastkey ? $self->[LOOKUP]{$lastkey} : -1; $index++; return $self->[ARRAY][$index]; } else { my $match; for my $key ($self->[SORT]($self->[HASH])) { + return $key if $match || ! defined $lastkey; $match = 1 if $key eq $lastkey; } } return undef; } sub sortroutine { my($self, $sort) = @_; croak "$sort is not a code ref" if ref $sort ne 'CODE'; $self->[SORT] = $sort; $self->[CHANGED] = 1; } 1; __END__ =head1 NAME Tie::HashSort - Perl module to get hash keys in a sorted order =head1 SYNOPSIS use Tie::HashSort; my %hash = ( 'John' => 33, 'Jacob' => 29, 'Jingle' => 15, 'Heimer' => 48, 'Smitz' => 12, ); my $sort = sub { my $hash = shift; sort {$hash->{$b} <=> $hash->{$a}} keys %$hash; }; tie my %sorted_hash, 'Tie::SortedHash', 'HASH' => \%hash, 'SORT' = +> $sort, 'OPTLEVEL' => 2; for my $name ( keys %sorted_hash ) { print "$name is $hash{$name} ears old.\n"; } ### OUTPUT ### Heimer is 48 ears old. John is 33 ears old. Jacob is 29 ears old. Jingle is 15 ears old. Smitz is 12 ears old. =head1 DESCRIPTION This module is a designed to retrieve hash keys in a pre-defined sorte +d order. It is often frustrating to have a hash return elements in a seemingly +random order when using C<keys()>, C<values()> or C<each()>. =head2 Tie In order to C<tie()> your hash to C<Tie::SortedHash>, use the followin +g syntax: tie HASH, 'Tie::SortedHash', OPTIONS; =cut
    I believe I have added functionality and avoided all the problems in Tie::SortHash. Here is my philosophy on the 3 optimization levels:
  • Level 1: Trade lots of memory for speed. Create an array with the hash keys in sorted order each time FIRSTKEY is called. NEXTKEY is then a simple matter of getting the index of the lastkey and adding 1. This is accomplished by having a hash lookup table.
  • Level 2: Identical to level 1 only faster. The array with sorted keys and hash lookup table are only recreated when a new key is added, a value is changed, or the sort routine is changed. Deleting a key doesn't require a rebuild because the element is deleted from the array and the lookup hash. This optimization only works on 1 dimensional hashes since it is impossible to detect values being changed below the root level. Depending on the sort routine, this may affect the order.
  • Level 3: No optimization at all. It is slower, but consumes no additional memory - (n2 + n) /2.
  • So what do you think - should I work on polishing the code, finishing the POD, building the test suite, and uploading it to CPAN or not?

    Cheers - L~R

    Replies are listed 'Best First'.
    Re: Any Point in Uploading Tie::SortedHash
    by Abigail-II (Bishop) on Sep 05, 2003 at 22:20 UTC
      Level 2: Identical to level 1 only faster. The array with sorted keys and hash lookup table are only recreated when a new key is added, a value is changed, or the sort routine is changed.

      That's not necessarely faster. It depends on how often you are inserting and how often you are asking for the keys. What you could do is some form of hybrid:

      • Create an array @sorted, and a variable $clean, initially set to 0.
      • If FIRSTKEY is called, check the value of $clean. If it's true, @sorted contains the sorted keys. If false, fill @sorted with the sorted keys of the hash, and set $clean to 1.
      • If you insert a key in the hash, or change the sort order, or delete something from the hash, set $clean to 0.
      • This might cause some bizarre results if you modify a hash while iterating over it, but that may happen with normal hashes too.

      Deleting a key doesn't require a rebuild because the element is deleted from the array and the lookup hash.

      Are you aware that deleting an element from an array can take time linear in the size of the array?

      Abigail

        Abigail,
        My verbiage was incorrect. The code is a hybrid as you describe. The rebuild does not happen when a key is deleted, value is changed, etc. Only a flag is set that is checked when FIRSTKEY is called. A rebuild only happens under those circumstances.

        In reference to deleting an array element taking time in linear proportion to the size of the array, no I was not aware of this. The two alternatives I can think of are:

      • Set the CHANGED flag and rebuild if FIRSTKEY is ever called again
      • Use more memory and delay the delete(s) until FIRSTKEY is called again. If in the interim the hash is modified, the rebuild would occur and the delete(s) would be forgotten. If FIRSTKEY was never called again, they would also be forgotten (albeit using some memory).

        Do you see a clear way to go?

        Cheers - L~R

    Re: Any Point in Uploading Tie::SortedHash
    by samtregar (Abbot) on Sep 05, 2003 at 21:58 UTC
      I say go for it. It sounds like your module has some significant advantages over the existing modules. The worst that can happen is that no one will use it.

      I think one of CPAN's most underappreciated strengths is the diversity of solutions to common problems. Having more than one module to solve problem X is actually a good thing if problem X has more than one reasonable solution. It allows users the ability to choose the best solution given their requirements.

      -sam

        your module has some significant advantages over the existing modules

        Make sure you (L~R) say why, in the documentation (the README and POD). Also be frank and point out when and why using the other modules could be a better Way To Do It. Such information is without price when evaluating different modules that appear to do the same thing.

        And if the author of another module chooses to correct you (because you misunderstood some of its finer points), you can always update the documentation in a later version.

          grinder,
          I understand that importance. There is a reference to the POD to look at the README for comparison to similar modules. Here is an advanced copy of the README.

          Cheers - L~R

    Re: Any Point in Uploading Tie::SortedHash
    by Limbic~Region (Chancellor) on Sep 06, 2003 at 02:35 UTC
      All:
      After Abigail's response, I realized that of my 3 optimization methods, only 1 of them worked correctly according to the docs for each. perldoc -f each says that it is safe to delete the last key visited while iterating over the keys in an each loop.

      It was easy to go from 1/3 to 2/3 working by simply changing the "optimization" for delete. This only left the exponential loop to deal with. The reason for this optimization was to save memory (an extra array and hash).

      Since the foreach loop constructs a list the same size as the array anyway, the only real saving of this method was not using the lookup hash. I figured out a way to:

    • Drop the (n2 + n)/2 loop by avoiding the lookup hash
    • Make the code cleaner
    • Function properly (I hope)
    • Be more memory/speed efficient

      Cheers - L~R

    Re: Any Point in Uploading Tie::SortedHash
    by liz (Monsignor) on Sep 05, 2003 at 22:06 UTC
      Until pretty recently I would have been rather strongly against putting yet another module on CPAN that does something that is almost the same as something else on CPAN. Solely because it will make it more and more difficult to choose the "right" module for your application from CPAN.

      But I have been guilty of putting modules similar to other modules on CPAN multiple times already.

      I think with the rating system on search.cpan.org, people will be able to get a better handle on the quality of a specific module, which will thus make it easier to make a choice. And thus reduce my only objection.

      So I would also say, go for it!

      Liz

    Re: Any Point in Uploading Tie::SortedHash
    by Anonymous Monk on Sep 06, 2003 at 16:40 UTC

      Did you know that you can tie to an in-memory BTREE in DB_File to have sorted hashes?

        Anonymous Monk,
        No, I didn't know that. Being a novice when it comes to databases, it sounds complicated. Has anyone written a module that utilizes this for having sorted hashes? I would be interested in knowing if this technique was capable of supporting all the features in my module. I am always interested in learning alternate/better ways of doing things.

        Cheers - L~R

          It isn't complicated at all, and you do not need any modules beyond DB_File itself (which is part of the standard distro).

          Tieing to a DB_File BTREE is no harder than tie'ing to anything else --- to keep it in memory (instead of in a file) just use undef in place of the filename argument. You can easily supply custom sort routines (default is lexical sorting).

          use DB_File; $DB_BTREE->{compare} = sub { $_[1] cmp $_[0] }; tie my %h, 'DB_File', undef, O_RDWR|O_CREAT, 0666, $DB_BTREE or die "DB_File tie failed: $!"; %h = 'a' .. 'z'; print "$_ : $h{$_}\n" for keys %h;
    Re: Any Point in Uploading Tie::SortedHash
    by demerphq (Chancellor) on Sep 07, 2003 at 10:52 UTC

      L~R I think you should post this to CPAN. You seem to think it has benefits, so why not? Just make sure that it includes a comprehensive test suite (I recommend Test::More) and package it up properly and send it in. I think youll enjoy the experience so you should do it for sure.

      However I do have one nit. I dont like the name. I think it should be something like Tie::Hash::Sorted or something. It really irks me how flat the Tie:: namespace is. I would much prefer to say "i /Tie::Hash/" and get back all the modules that tie hashes in an interesting way.


      ---
      demerphq

      <Elian> And I do take a kind of perverse pleasure in having an OO assembly language...