http://qs321.pair.com?node_id=1035959

davido has asked for the wisdom of the Perl Monks concerning the following question:

The POD for Encode::Encoding offers this snippet as an example of creating a custom encoding:

package Encode::ROT13; use strict; use base qw(Encode::Encoding); __PACKAGE__->Define('rot13'); sub encode($$;$){ my ($obj, $str, $chk) = @_; $str =~ tr/A-Za-z/N-ZA-Mn-za-m/; $_[1] = '' if $chk; # this is what in-place edit means return $str; } # Jr pna or ynml yvxr guvf; *decode = \&encode; 1;

Encode provides the following explanation for defining encodings:

To define a new encoding, use:

use Encode qw(define_encoding); define_encoding($object, CANONICAL_NAME [, alias...]);

CANONICAL_NAME will be associated with $object. The object should provide the interface described in Encode::Encoding. If more than two arguments are provided, additional arguments are considered aliases for $object.

See Encode::Encoding for details.

I seem to be failing to connect the dots. How would one instantiate $object, which in the case of the example code above, I think, should be an instance of the Encode::Rot13 class? What needs to happen before I can "binmode $fh, ':rot13';"?


Dave

Replies are listed 'Best First'.
Re: Creating (and using) a custom encoding. (SOLUTION)
by davido (Cardinal) on May 31, 2013 at 17:01 UTC

    It turns out that most of the confusion was due to File::Slurp RT#84918, submitted by our friend corion. "File-Slurp: read_file() ignores binmode option for short files". If only I had suspected File::Slurp earlier, I could have saved myself (and others) some time.

    Here's a complete working example. Note, you must binmode the filehandle with ":encoding(rot13)", not the more terse ":rot13" (which simply won't work). Also, there's no need to explicitly call define_encoding from Encode within the calling package; the line __PACKAGE__->Define('rot13'); does that for us.

    package Encode::ROT13; use strict; use warnings; use parent qw( Encode::Encoding ); sub encode($$;$){ my( $obj, $str, $chk ) = @_; $str =~ tr/A-Za-z/N-ZA-Mn-za-m/; $_[1] = '' if $chk; # $_[1] is aliased through the call. Inplace edi +t. # (Remove whole string unless there's an error.) return $str; } no warnings 'once'; *decode = \&encode; # Because rot13( rot13() ) is a round-trip. __PACKAGE__->Define( 'rot13' ); 1; package main; use strict; use warnings; binmode \*DATA, ':encoding(rot13)'; chomp( my @words = <DATA> ); print "$_\n" for @words; __DATA__ Apple cat dog strawberry watermelon

    ...and the output...

    Nccyr png qbt fgenjoreel jngrezryba

    ...now on to learn how to use the enc2xs tool.


    Dave

      In case you haven't gotten all the way yet with enc2xs, the only "hard" part is to build the appropriate "Unicode Character Map" (ucm) file to describe the relationship between Unicode and your specialized character encoding.

      In case it helps, you might want to look at Encode::Buckwalter, which includes a ucm file to define a specialized ASCII "alphabet" for transliterating Arabic characters. It's fairly simple, except that some character relations only work in one direction (e.g. when going from Unicode to "Buckwalter Transliteration", U+0030 and U+0660 will both map to ASCII "0", but when going from transliteration to Unicode, ASCII "0" will only map to U+0030, and likewise for other digits).

        Thanks. I appreciate the links.

        What first motivated this investigation was a quest for alternatives to automatically apply fold case (fc) to an incoming file. I'm well aware that this is a road less traveled. Certainly it violates "the principle of least surprise", and as such I wouldn't consider it for production code. But it's been an interesting investigation so far. :)


        Dave

Re: Creating (and using) a custom encoding.
by davido (Cardinal) on May 30, 2013 at 15:26 UTC

    I like Anonymous Monk's suggestion, as it facilitates "compiled" encodings. I'd still like to figure out the Encode::Encoding method though. Afterward I can probably post a tutorial on both methods.

    Here's a self-contained snippet that should be a framework from which to work on getting the Encode::Encoding method going. Unfortunately, it doesn't work (ie, the rot13 encoding doesn't take place) because I'm still unclear on how Encode's define_encoding() fits into the mix. If anyone can fill in the blank, or possibly point to the documentation that I seem to have missed, I would certainly appreciate it.

    package Encode::ROT13; use strict; use warnings; use parent qw( Encode::Encoding ); __PACKAGE__->Define( 'rot13' ); sub encode($$;$){ my( $obj, $str, $chk ) = @_; $str =~ tr/A-Za-z/N-ZA-Mn-za-m/; $_[1] = '' if $chk; # $_[1] is aliased through the call. Inplace edi +t. return $str; } no warnings 'once'; *Encode::ROT13::decode = \&encode; # Because rot13( rot13() ) is a rou +nd-trip. 1; package main; use strict; use warnings; use Encode 'define_encoding'; use File::Slurp; # define_encoding( $object, 'rot13' ); # Um....? my @words = read_file( \*DATA, chomp => 1, binmode => ':rot13' ); print "$_\n" for @words; __DATA__ Apple cat dog strawberry watermelon

    Dave

      OK. Now that I'm thoroughly confused:-):
      package Encode::ROT13; use strict; use warnings; use Encode; use parent qw( Encode::Encoding ); __PACKAGE__->Define( 'rot13' ); no warnings 'redefine'; sub encode($$;$){ my( $obj, $str, $chk ) = @_; $str =~ tr/A-Za-z/N-ZA-Mn-za-m/; $_[1] = '' if $chk; return $str; } *decode = \&encode; 1; package main; use strict; use warnings; use Encode 'define_encoding'; use File::Slurp; my $object = {}; bless($object, 'Encode::ROT13'); define_encoding( $object, 'rot13' ); my(@words) = read_file( \*DATA, chomp => 1, binmode => ':rot13' ); print "$_\n" for @words; my(@list) = Encode->encodings(); print "Checking for rot13: \n"; foreach my $list(@list) { print "\t$list\n";; } __DATA__ Apple cat dog strawberry watermelon

        I'm guessing that doesn't work for you either.

        A few nits (even though neither of us have it working yet): Encode doesn't need to be 'used' in package Encode::ROT13 (and eliminating it eliminates the "redefined" warnings). Also, blessing the object into package Encode::ROT13 may as well be written as a constructor in Encode::ROT13, and then called with the traditional Encode::ROT13->new syntax from the calling package (main) .... not that it matters -- it still doesn't work.

        Also, even if you completely eliminate 'use Encode qw/ define_encoding /; from package main, and eliminate the object construction as well as the call to define_encoding, 'rot13' still shows up in the encodings list. However, placing a "warn" inside of Encode::ROT13::encode shows that the function never gets called.

        What are we missing? :) There really seems to be a disconnect between the POD and reality. But I'm sure it works for someone (who knows how to do it right). Otherwise, the whole Encode::Encoding module is pretty useless.


        Dave

      Well , I guess it won't work. Substitute UTF-7 or encoding(UTF-7) for binmode and Encode::Unicode::UTF7 won't get loaded

      There is one test of Encode::Encoding in http://cpansearch.perl.org/src/DANKOGAI/Encode-2.51/t/Encoder.t and it doesn't use binmode/layers, its encode/decode functions

      Fudge always saves :) File::Slurp is hiding the critical warnings
      you should use :encoding(rot13)

      your start

      the fudge

      use Data::Dump ; dd\%Encode::Encoding; for my $enc ( qw/ :rot13 :ROT13 :encoding(rot13) :encoding(ROT13) :UTF-7 :encoding(UTF-7) :via(ROT13) :via(Encode::ROT13) / ) { fudge( $enc ); } dd\%Encode::Encoding; our $tell; sub fudge { my( $layer ) = @_; $tell ||= tell DATA; seek DATA, $tell, 0; print "## binmode DATA, $layer \n"; binmode DATA, $layer; dd [<DATA>]; } __DATA__ Apple cat dog strawberry watermelon

      output

Re: Creating (and using) a custom encoding.
by Anonymous Monk on May 30, 2013 at 05:02 UTC

    See enc2xs (and its output) and PerlIO::via/File::BOM source

    I vaguely recall having tried to make my own, but I can't find my notes :)

Re: Creating (and using) a custom encoding.
by derby (Abbot) on May 30, 2013 at 12:46 UTC

    I'm uncertain wether to be in awe of your custom encoding endeavors or run away in abject terror :-)

    -derby