Properly testing self-compiled character-encodings

yulivee07 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Perlmonks, I am searching for a proper way to test various character-encodings if they works as expected on my platform.

I have some special IBM-character codepages I need to include for my Platfrom (AIX 7.2). I have built these according to the manual delivered with enc2xs http://search.cpan.org/dist/Encode/bin/enc2xs

However, I do not trust them. I took a look with the strings utility whether there are characters in the produced binary file:

$ strings CP1141.so
$
[download]

strings just returns nothing. Using another C-Compiler solved this problem and produced binary files that contain characters.

So now I want to create a test to see if the encoding is working properly. I have an older machine where all those character-encodings are working and installed, so I could generate files containing correct information there and copy them over to the new machine for testing.

I am not entirely shure about a good testing strategy. I thought of testing characters beyond the 128th character (as below theyy would be all equal as it is ASCII). Does this seem reasonable?

I looked into the Encode distribution in search for unit-tests for encodings but didn't find much. Are there best practices for cases like this?

Regards, Yulivee

Update: I wanted to share my solution I found and used fro the problem. I still had a working AIX 6.1 Machine with all encodings working, so I created this script, to produce me encoded files in the encodings I wanted to test:

encode_it.pl

#!/usr/bin/env perl 
  
use strict;
use warnings;
use utf8;
  
use Encode qw (:DEFAULT is_utf8);
use Encode::CP924;
# more encodings here, removed to save space

my %encodings = ( 
    CP924 => {
          name => "ibm-924_P100-1998",
          string => "\N{U+0000}\N{U+0001}\N{U+0002}[...]"
             },
    #more encodings here, removed to save space
);

foreach my $encoding ( sort keys %encodings ) {
    
    print "Current Encoding: $encoding - $encodings{$encoding}{'name'}
+ \n";
    
    my $utf8_decode = $encodings{$encoding}{'string'};
  
    my $encoded_output;
    eval { $encoded_output = encode( $encodings{$encoding}{'name'}, $u
+tf8_decode ); }; # filecontent is encoded from utf-8 to current encod
+ing
    if ( $@ ){
        print $@,"skipping encoding\n";
        next;
    }
  
    open ( my $fh_out, '>', $encoding ) or die;
    print $fh_out $encoded_output;
    close $fh_out;
  
}
[download]

to generate the string in the hash of the encoding, I crawled the UCM-File of the corresponding character-encoding to get the name of all unicode-points to include. The script takes a ucm-file as input and prints alls unicode-points in format \N{U+0001} to STDOUT

generate_charmap_for_testing.pl

#!/usr/bin/perl                                                       
+                                                                     
+                                                                     
+ 
use strict;
use warnings;
use Getopt::Long;
 
our %opt = (); 
{
    my %options = ( 
            'file=s' => \$opt{file},
                  );
    GetOptions(%options);
}    
 
 
exit 0 unless $opt{file};
 
my $filename = $opt{file};
open(my $fh, '<:encoding(UTF-8)', $filename) or die "Could not open fi
+le '$filename' $!";
     
print "string => \"";
while (my $row = <$fh>) {
      chomp $row;
      if ( $row =~ /\<U[\w\d]{4}\>.*/) {
          $row =~ s/\<U([\w\d]{4})\>.*/\\N\{U+$1\}/g;
          print $row;
      }   
}
print "\"\n";
[download]

Then I transferred the encoded files to my new host. On the new host I created a script called decode_it.pl. It reads in the file, decodes its decoding to utf-8, and back to its original encoding. If the original text and the one after encoding back match, I count this as a succesfull test.

decode_it.pl

#!/usr/bin/env/perl                                                   
+                                                                     
+                                                                     
+     
use strict;                   
use warnings;                 
use utf8;                     
                              
use Encode qw (:DEFAULT is_utf8);
                              
                              
my %encodings = (
                 CP924       => "ibm-924_P100-1998",
                 # more encodings here
                );
                              
exit 0 unless @ARGV;            
                              
foreach my $enc_file ( @ARGV ) { 
    next if $enc_file eq "decode_it.pl"; 
    next if $enc_file eq "encode_it.pl"; 
    next if $enc_file eq "generate_charmap_for_testing.pl";
                              
    unless ( $encodings{$enc_file} ) { 
        print "No valid encoding definition for $enc_file\n";
        next;                 
    }                         
                              
    my $module = "Encode::".$enc_file;
    eval{                     
        (my $file = $module) =~ s|::|/|g;
        require $file.'.pm';  
        $module->import();    
        1;                    
    } or do {                 
        print "$module not found\n";
        next;                 
    };                        
                              
    open( my $fh_in, '<', $enc_file) or next;
                              
    my $filecontent = do{     
        local  $/  = undef;                 # input record separator u
+ndefined
        <$fh_in>              
    };                        
                              
    my $content;              
    eval{ $content = decode ( $encodings{$enc_file}, $filecontent ); }
+;
                              
    if ( $@ ){                
        print $@,"skipping encoding\n";
        next;                 
    }                         
                              
    my $encoded_content = encode ( $encodings{$enc_file}, $content );
    my $decoded_content = decode ( $encodings{$enc_file}, $encoded_con
+tent );
                              
    if ( $decoded_content eq $content ) { 
        print "Encoding $enc_file is working properly\n";
    } else {                  
        print "Encoding $enc_file produces errors\n";
    }                         
                       
}
[download]

Final Output looks like this:

./decode_it.pl *
Encoding CP924 is working properly
Encoding Cp1025 is working properly
Encoding Cp1122 is working properly
Encoding Cp1140 is working properly
Encoding Cp1141 is working properly
Encoding Cp1142 is working properly
Encoding Cp1143 is working properly
Encoding Cp1144 is working properly
Encoding Cp1145 is working properly
Encoding Cp1146 is working properly
Encoding Cp1147 is working properly
Encoding Cp1148 is working properly
Encoding Cp1149 is working properly
Encoding Cp1153 is working properly
Encoding Cp1388 produces errors
Encoding Cp1399 produces errors
Encoding Cp273 is working properly
Encoding Cp285 is working properly
Encoding Cp297 is working properly
Encoding Cp424 is working properly
Encoding Cp870 is working properly
Encoding Cp933 produces errors
Encoding Cp937 produces errors
Encoding CpMacintosh is working properly
Encoding CpTIS620 is working properly
Encoding Gb18030 is working properly
Encoding Gb2312 is working properly
Encoding NATSDANO is working properly
[download]

It works really well - except for the Chinese EBCDIC encodings. Somehow, the transition does produce different results. The result is the same on my old and the new box.

So, what do you think of my solution?
And does anybody have an idea why the conversion fails for the Chinese Character-Sets?
Kind Regards, Yulivee

Comment on Properly testing self-compiled character-encodings Select or Download Code

Replies are listed 'Best First'.
Re: Properly testing self-compiled character-encodings by Corion (Patriarch) on Jan 23, 2017 at 12:24 UTC
Yes, that would be my approach as well (and I should add those cases to Encode::DIN66003. Take a set of strings and their known, manually verified encoding, and test that your module still encodes them properly: use Test::More; use Encode 'encode', 'decode'; my @tests = ( { known => "Hello World", bytes_1141 => "Hello World" }, { known => "\N{LATIN CAPITAL LETTER A WITH DIAERESIS}", bytes_1141 + => "{" }, # or whatever { known => "\N{LATIN CAPITAL LETTER U WITH DIAERESIS}", bytes_1141 + => "}" }, # or whatever ); plan 3@tests; for my $test (@tests) { my( $name ) = $test->{name} \|\| $test->{known}; is encode( 'CP1141', $test->{known} ), $test->{bytes_1141}, "Encod +ing for '$name'" ); is decode( encode( 'CP1141', $test->{known} ), $test->{known}, "Ro +undtrip for '$name'" ); is decode( 'CP1141', $test->{bytes_1141}), $test->{known}, "Decodi +ng for '$name'" ); }; done_testing; [download] Some of the test cases won't roundtrip cleanly, but you should likely also test for unknown characters like the Euro sign or curly braces. Update*: Fixed module name, as spotted by choroba.	[reply] [d/l]
Re: Properly testing self-compiled character-encodings by LanX (Saint) on Jan 23, 2017 at 12:31 UTC
Hey! Maybe of interest, I wrote a routine peek() for visual testing of encoded strings. See �Re: Converting utf-8 to base64 and back� for automatic testing maybe a loop over eq-tests of octet strings (ie without utf8 flag)? probably speeding it up by testing a long string first and only parts of it if the long test failed? HTH :) Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply]


XP is just a number
	PerlMonks