RFC on Inline::C hack: Hash

Dear Perl's-internals-hacking monks,

I've implemented a Hash_Iterator class, as replacement for CORE::each. The reasons for this are 1) to enable multiple independent lexically-scoped iterators on the same hash; 2) to be able to apply to such iterators various functions that take iterators as arguments (one aspect of this being to extend the hash iterator's interface to cover the four iterator functions (start, exhausted, next, and value) described by Dominus in HOP); 3) as a beginner's exercise with Inline::C; and, last but not least, 4) to increase my Perl Crackpot Index.

One could have implemented this class by simply using keys to generate an array of all the hashkeys, and have the iterator object keep track of a position in this array. Such implementation, however, duplicates the storage that already exists for the hash. One of the features that makes iterators in general attractive is precisely that they can take up significantly less memory than the corresponding array. Iterators can be composed into more complex iterative structures, but such schemes can quickly become prohibitive unless the iterators are sufficiently lightweight. Therefore an array-based implementation of hash iterators would not be very powerful.

Be that as it may, it's done and I would very much appreciate your comments (I give the code below).

Given that I have not found this sort of iterator already, the main concern I have is that, despite appearances (i.e. the seemingly functioning module presented below) the problem may be much harder than I realize. Therefore, my top question is: is what I'm trying to do feasible in Perl (without going to heroic lengths)? Or are there fundamental reasons for why anything like this would fail in Perl5?

The basic idea of my implementation is that the iterator object is also a hash, except that its internal bucket array pointer points to the bucket array of the hash that's being iterated over. The underlying C struct has its own independent iteration-oriented fields (e.g. riter and eiter) that are unaffected by those of other iterator instances on the same hash, or even by using CORE::each on this hash.

For my implementation I cannibalized sizeable chunks of hv.c and the pp_each section of pp.c. Unfortunately my understanding of the code that I hacked is less than complete. Therefore, even though my implementation passes the toughest test suite I could come up with, I still feel rather uncertain about the soundness of the C code, and I would particularly welcome your comments and criticisms of it.

I realize that this is a fair bit of code to read. I hope that some of you may be willing to take a look, or perhaps advise me on how to get knowledgeable feedback on a longish piece of code like this.

Many thanks in advance.

SYNOPSIS

  use Hash_Iterator;

  my $it = Hash_Iterator->new( \%some_hash );

  while ( defined( my $k = $it->each ) ) {
    # do something with $k
  }

  # $it->each restarts after hitting the end

  while ( my ( $k, $v ) = $it->each ) {
    # one more iteration
  }

  # alternatively, $it->start resets the iterator

  for ( $it->start; !$it->exhausted; $it->next ) {
    my $k = $it->value;
    # do something with $k
  }

  # here's another way to traverse the hash
  $it->start;
  until ( $it->exhausted ) {
    my ( $k, $v ) = $it->next;
    # do something with ( $k, $v )
  }


DESCRIPTION

The API is as follows:

new( $hashref )    The constructor, obviously.  The argument is
                   mandatory, and must be a hashref.

$it->each          Should behave exactly like CORE::each.

$it->start         Resets the iterator.  (It also returns the first
                   value.  This is as described in HOP.)

$it->value         Returns the current value, according to the
                   same rules as CORE::each (i.e. both key and
                   value are returned in list context); does not
                   advance the iterator.

$it->next          Advances the iterator.  (It also returns
                   the results of $it->value *before* advancing
                   the iterator, as described in HOP.)

$it->nextval       An alias for $it->next().

$it->exhausted     Returns true iff the iterator is exhausted.

NB: The only difference between the next() (or nextval()) and
each() methods is that the latter (but not the former) 
automatically resets the iterator after hitting the end once.
[download]

use strict;
use warnings;

package Hash_Iterator;

use Carp ();
BEGIN {
  *_croak   = \&Carp::croak;
}

my %Hashes;

*nextval = \&next;

!! 'keep require happy';

sub new {
  my $class = shift;
  _croak "Bad arguments to constructor"
    unless @_ == 1 and ref $_[ 0 ] eq 'HASH';

  my $hash = shift;
  my $self = bless +{}, $class;
  $Hashes{ $self } = $hash;
  $self->start;
  return $self;
}

sub start {
  my $self = shift;
  _start_iter( $self, $Hashes{ $self } );
  return $self->value;
}

sub next {
  my $self = shift;
  if ( wantarray ) {
    my @ret = $self->value;
    $self->_advance;
    return @ret;
  }
  else {
    my $ret = $self->value;
    $self->_advance;
    return $ret;
  }
}

sub each {
  my $self = shift;
  if ( wantarray ) {
    my @ret = $self->value;
    $self->exhausted ? $self->start : $self->_advance;
    return @ret;
  }
  else {
    my $ret = $self->value;
    $self->exhausted ? $self->start : $self->_advance;
    return $ret;
  }
}

sub DESTROY {
  my $self = shift;
  delete $Hashes{ $self };
  _DESTROY_iter( $self );
  return;
}

sub _advance {
  my $self = shift;
  _synch_and_advance( $self, $Hashes{ $self } )
    unless $self->exhausted;
  return;
}

use Inline C => <<EOC;

void _start_iter( HV *hv, HV *ohv ) {
  register XPVHV* xhv;

  if ( SvMAGICAL( ( SV * ) ohv ) )
    Perl_croak( "Not implemented for tied/magical hashes." );

  xhv = ( XPVHV* ) SvANY ( hv );
  xhv->xhv_riter = -1;
  xhv->xhv_eiter = NULL;

  _synch_and_advance( hv, ohv );

  return;
}

I32 exhausted( HV *hv ) {
  register XPVHV* xhv;
  if ( !hv ) Perl_croak( "Bad hash" );
  xhv = ( XPVHV* ) SvANY ( hv );
  return xhv->xhv_riter < 0;
}

void _synch_and_advance( HV *hv, HV *ohv ) {
  register XPVHV* xhv;
  register XPVHV* oxhv;

  if ( !hv || !ohv ) Perl_croak( "Bad hash" );
  xhv  = ( XPVHV* ) SvANY ( hv );
  oxhv = ( XPVHV* ) SvANY ( ohv );

  __synch( xhv, oxhv );
  __advance( xhv );
  return;
}

void __advance( register XPVHV* xhv ) {
  register HE *entry;

  if ( !xhv->xhv_array ) {
    xhv->xhv_riter = -1;
    xhv->xhv_eiter = NULL;
    return;
  }

  entry = xhv->xhv_eiter;

  /* At start of hash, entry is NULL.  */
  if ( entry ) {
    entry = HeNEXT( entry );
    while ( entry && HeVAL( entry ) == &PL_sv_placeholder )
      entry = HeNEXT( entry );
  }

  while ( !entry ) {
    xhv->xhv_riter++;
    if ( xhv->xhv_riter > ( I32 ) xhv->xhv_max ) {
      xhv->xhv_riter = -1;
      break;
    }
    entry = ( ( HE** ) xhv->xhv_array )[ xhv->xhv_riter ];

    while ( entry && HeVAL( entry ) == &PL_sv_placeholder )
      entry = HeNEXT( entry );
  }

  xhv->xhv_eiter = entry;
  return;
}

void value( HV *hv ) {
  Inline_Stack_Vars;
  register XPVHV* xhv;
  HE *entry;
  I32 gimme = GIMME_V;
  I32 n_items = 1;

  xhv = ( XPVHV* ) SvANY( hv );
  entry = xhv->xhv_eiter;

  if ( gimme == G_VOID || !entry && gimme == G_ARRAY )
    Inline_Stack_Void;          /* returns (nothing) */

  Inline_Stack_Reset;
  if ( entry ) {
    Inline_Stack_Push( hv_iterkeysv( entry ) );

    if(gimme == G_ARRAY) {
      Inline_Stack_Push( hv_iterval( hv, entry ) );
      ++n_items;
    }
  }
  else {
    /* scalar context, no entry */
    Inline_Stack_Push( sv_2mortal( &PL_sv_undef ) );
  }
  Inline_Stack_Done;
  Inline_Stack_Return(n_items);
}

void _DESTROY_iter( HV *hv ) {
  register XPVHV* xhv;
  if ( !hv ) return;
  xhv = ( XPVHV* ) SvANY ( hv );

  if ( xhv->xhv_name ) {
    if ( PL_stashcache )
      hv_delete( PL_stashcache, xhv->xhv_name, strlen( xhv->xhv_name )
+,
                 G_DISCARD );
    Safefree( xhv->xhv_name );
    xhv->xhv_name = 0;
  }
  xhv->xhv_max          = 7;    /* (it's a normal hash) */
  xhv->xhv_array        = 0;
  xhv->xhv_placeholders = 0;
}

void __synch( register XPVHV* xhv, register XPVHV* oxhv ) {
  xhv->xhv_max   = oxhv->xhv_max;
  xhv->xhv_fill  = oxhv->xhv_fill;
  xhv->xhv_keys  = oxhv->xhv_keys;
  xhv->xhv_array = ( HE ** ) oxhv->xhv_array;
}

EOC

__END__
[download]

Update: Minor, mostly cosmetic, edits to the code.

the lowliest monk

Comment on RFC on Inline::C hack: Hash_Iterator Select or Download Code

Replies are listed 'Best First'.
Re: RFC on Inline::C hack: Hash_Iterator by esskar (Deacon) on Aug 01, 2005 at 12:27 UTC
nice job; is it possible to add a "prev" method? and where can I find the documentation to HeNEXT ?	[reply]
Re^2: RFC on Inline::C hack: Hash_Iterator by tlm (Prior) on Aug 01, 2005 at 12:55 UTC
...and where can I find the documentation to HeNEXT ? `HeNEXT` is a macro (surprise!), defined in hv.h as `#define HeNEXT(he) (he)->hent_next` [download] where `hent_next` is a field in the following struct (also from hv.h): `typedef struct he HE; /* ... / / entry in hash value chain / struct he { HE hent_next; /* next entry in chain / HEK hent_hek; /* hash key / SV hent_val; /* scalar value that was hashed */ };` [download] So basically `HeNEXT` scoots along a linked list of hash entries (those belonging to a given bucket, to be precise). As you can see, the data structure only provides for travel in one direction. Therefore, I don't see an easy way to implement a `prev` method. If I had to have it, I think I'd just accept the memory hit and use the `keys` array-based implementation I described briefly in my OP. Then `prev` reduces to decrementing a cursor variable. the lowliest monk	[reply] [d/l] [select]
Re^3: RFC on Inline::C hack: Hash_Iterator by Tanalis (Curate) on Aug 02, 2005 at 07:19 UTC
So basically HeNEXT scoots along a linked list of hash entries (those belonging to a given bucket, to be precise). Well.. you could always make use of a doubly-linked list to get around this. For the sake of the storage of a second pointer in your structure, this would seem a reasonable trade-off to give a more flexible iteration method. Just on a scalability note, and to raise awareness of a possible stumbling block if this code were to be used as anything other than an iterator, I'm always wary of a hash bucket containing a linked list. In this example, there's little to no issue - it's designed to be an iterator, and hence the list will always be traversed from start to end. For (most) implementations, however, where more random access of hashed elements is required, an iterative method is really quite inefficient, requiring up to time `const + N` to look up any entry. In these cases, it's often better to use something akin to a binary tree, offering at worst time `const + log N` to find any entry. For true flexibility, a linked list threaded through binary tree has been my structure of choice for awhile - offering a sane method of iterating through the structure while retaining a reasonable random-element-access time. Nice work - ++tlm. -- Foxcub `#include www.liquidfusion.org.uk`	[reply] [d/l] [select]
Re^4: RFC on Inline::C hack: Hash_Iterator by demerphq (Chancellor) on Aug 02, 2005 at 11:03 UTC
Re^5: RFC on Inline::C hack: Hash_Iterator by Tanalis (Curate) on Aug 02, 2005 at 11:51 UTC
Some notes below your chosen depth have not been shown here
Re: RFC on Inline::C hack: Hash_Iterator by spectraphonic (Scribe) on Aug 01, 2005 at 11:57 UTC
nifty stuff.... bows (in a monklike fashion) i like the iterator concept for processing data structures. it really simplifies the code and improves its readability. independent iterators to the same structure seems handy too. kinda reminds me of C++ STL i aspire to produce ninja code like this	[reply]

RFC on Inline::C hack: Hash_Iterator