Hi Perlmonks,
I am searching for a proper way to test various character-encodings if they works as expected on my platform.
I have some special IBM-character codepages I need to include for my Platfrom (AIX 7.2). I have built these according to the manual delivered with enc2xs
http://search.cpan.org/dist/Encode/bin/enc2xs
However, I do not trust them. I took a look with the strings utility whether there are characters in the produced binary file:
$ strings CP1141.so
$
strings just returns nothing. Using another C-Compiler solved this problem and produced binary files that contain characters.
So now I want to create a test to see if the encoding is working properly. I have an older machine where all those character-encodings are working and installed, so I could generate files containing correct information there and copy them over to the new machine for testing.
I am not entirely shure about a good testing strategy. I thought of testing characters beyond the 128th character (as below theyy would be all equal as it is ASCII). Does this seem reasonable?
I looked into the Encode distribution in search for unit-tests for encodings but didn't find much. Are there best practices for cases like this?
Regards,
Yulivee
Update:
I wanted to share my solution I found and used fro the problem. I still had a working AIX 6.1 Machine with all encodings working, so I created this script, to produce me encoded files in the encodings I wanted to test:
encode_it.pl
#!/usr/bin/env perl
use strict;
use warnings;
use utf8;
use Encode qw (:DEFAULT is_utf8);
use Encode::CP924;
# more encodings here, removed to save space
my %encodings = (
CP924 => {
name => "ibm-924_P100-1998",
string => "\N{U+0000}\N{U+0001}\N{U+0002}[...]"
},
#more encodings here, removed to save space
);
foreach my $encoding ( sort keys %encodings ) {
print "Current Encoding: $encoding - $encodings{$encoding}{'name'}
+ \n";
my $utf8_decode = $encodings{$encoding}{'string'};
my $encoded_output;
eval { $encoded_output = encode( $encodings{$encoding}{'name'}, $u
+tf8_decode ); }; # filecontent is encoded from utf-8 to current encod
+ing
if ( $@ ){
print $@,"skipping encoding\n";
next;
}
open ( my $fh_out, '>', $encoding ) or die;
print $fh_out $encoded_output;
close $fh_out;
}
to generate the string in the hash of the encoding, I crawled the UCM-File of the corresponding character-encoding to get the name of all unicode-points to include. The script takes a ucm-file as input and prints alls unicode-points in format \N{U+0001} to STDOUT
generate_charmap_for_testing.pl
#!/usr/bin/perl
+
+
+
use strict;
use warnings;
use Getopt::Long;
our %opt = ();
{
my %options = (
'file=s' => \$opt{file},
);
GetOptions(%options);
}
exit 0 unless $opt{file};
my $filename = $opt{file};
open(my $fh, '<:encoding(UTF-8)', $filename) or die "Could not open fi
+le '$filename' $!";
print "string => \"";
while (my $row = <$fh>) {
chomp $row;
if ( $row =~ /\<U[\w\d]{4}\>.*/) {
$row =~ s/\<U([\w\d]{4})\>.*/\\N\{U+$1\}/g;
print $row;
}
}
print "\"\n";
Then I transferred the encoded files to my new host. On the new host I created a script called decode_it.pl. It reads in the file, decodes its decoding to utf-8, and back to its original encoding. If the original text and the one after encoding back match, I count this as a succesfull test.
decode_it.pl
#!/usr/bin/env/perl
+
+
+
use strict;
use warnings;
use utf8;
use Encode qw (:DEFAULT is_utf8);
my %encodings = (
CP924 => "ibm-924_P100-1998",
# more encodings here
);
exit 0 unless @ARGV;
foreach my $enc_file ( @ARGV ) {
next if $enc_file eq "decode_it.pl";
next if $enc_file eq "encode_it.pl";
next if $enc_file eq "generate_charmap_for_testing.pl";
unless ( $encodings{$enc_file} ) {
print "No valid encoding definition for $enc_file\n";
next;
}
my $module = "Encode::".$enc_file;
eval{
(my $file = $module) =~ s|::|/|g;
require $file.'.pm';
$module->import();
1;
} or do {
print "$module not found\n";
next;
};
open( my $fh_in, '<', $enc_file) or next;
my $filecontent = do{
local $/ = undef; # input record separator u
+ndefined
<$fh_in>
};
my $content;
eval{ $content = decode ( $encodings{$enc_file}, $filecontent ); }
+;
if ( $@ ){
print $@,"skipping encoding\n";
next;
}
my $encoded_content = encode ( $encodings{$enc_file}, $content );
my $decoded_content = decode ( $encodings{$enc_file}, $encoded_con
+tent );
if ( $decoded_content eq $content ) {
print "Encoding $enc_file is working properly\n";
} else {
print "Encoding $enc_file produces errors\n";
}
}
Final Output looks like this:
./decode_it.pl *
Encoding CP924 is working properly
Encoding Cp1025 is working properly
Encoding Cp1122 is working properly
Encoding Cp1140 is working properly
Encoding Cp1141 is working properly
Encoding Cp1142 is working properly
Encoding Cp1143 is working properly
Encoding Cp1144 is working properly
Encoding Cp1145 is working properly
Encoding Cp1146 is working properly
Encoding Cp1147 is working properly
Encoding Cp1148 is working properly
Encoding Cp1149 is working properly
Encoding Cp1153 is working properly
Encoding Cp1388 produces errors
Encoding Cp1399 produces errors
Encoding Cp273 is working properly
Encoding Cp285 is working properly
Encoding Cp297 is working properly
Encoding Cp424 is working properly
Encoding Cp870 is working properly
Encoding Cp933 produces errors
Encoding Cp937 produces errors
Encoding CpMacintosh is working properly
Encoding CpTIS620 is working properly
Encoding Gb18030 is working properly
Encoding Gb2312 is working properly
Encoding NATSDANO is working properly
It works really well - except for the Chinese EBCDIC encodings. Somehow, the transition does produce different results. The result is the same on my old and the new box.
So, what do you think of my solution?
And does anybody have an idea why the conversion fails for the Chinese Character-Sets?
Kind Regards,
Yulivee