Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Properly testing self-compiled character-encodings

by yulivee07 (Sexton)
on Jan 23, 2017 at 12:15 UTC ( [id://1180148]=perlquestion: print w/replies, xml ) Need Help??

yulivee07 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Perlmonks, I am searching for a proper way to test various character-encodings if they works as expected on my platform.

I have some special IBM-character codepages I need to include for my Platfrom (AIX 7.2). I have built these according to the manual delivered with enc2xs http://search.cpan.org/dist/Encode/bin/enc2xs

However, I do not trust them. I took a look with the strings utility whether there are characters in the produced binary file:
$ strings CP1141.so $
strings just returns nothing. Using another C-Compiler solved this problem and produced binary files that contain characters.

So now I want to create a test to see if the encoding is working properly. I have an older machine where all those character-encodings are working and installed, so I could generate files containing correct information there and copy them over to the new machine for testing.

I am not entirely shure about a good testing strategy. I thought of testing characters beyond the 128th character (as below theyy would be all equal as it is ASCII). Does this seem reasonable?

I looked into the Encode distribution in search for unit-tests for encodings but didn't find much. Are there best practices for cases like this?

Regards, Yulivee

Update: I wanted to share my solution I found and used fro the problem. I still had a working AIX 6.1 Machine with all encodings working, so I created this script, to produce me encoded files in the encodings I wanted to test:

encode_it.pl
#!/usr/bin/env perl use strict; use warnings; use utf8; use Encode qw (:DEFAULT is_utf8); use Encode::CP924; # more encodings here, removed to save space my %encodings = ( CP924 => { name => "ibm-924_P100-1998", string => "\N{U+0000}\N{U+0001}\N{U+0002}[...]" }, #more encodings here, removed to save space ); foreach my $encoding ( sort keys %encodings ) { print "Current Encoding: $encoding - $encodings{$encoding}{'name'} + \n"; my $utf8_decode = $encodings{$encoding}{'string'}; my $encoded_output; eval { $encoded_output = encode( $encodings{$encoding}{'name'}, $u +tf8_decode ); }; # filecontent is encoded from utf-8 to current encod +ing if ( $@ ){ print $@,"skipping encoding\n"; next; } open ( my $fh_out, '>', $encoding ) or die; print $fh_out $encoded_output; close $fh_out; }
to generate the string in the hash of the encoding, I crawled the UCM-File of the corresponding character-encoding to get the name of all unicode-points to include. The script takes a ucm-file as input and prints alls unicode-points in format \N{U+0001} to STDOUT

generate_charmap_for_testing.pl
#!/usr/bin/perl + + + use strict; use warnings; use Getopt::Long; our %opt = (); { my %options = ( 'file=s' => \$opt{file}, ); GetOptions(%options); } exit 0 unless $opt{file}; my $filename = $opt{file}; open(my $fh, '<:encoding(UTF-8)', $filename) or die "Could not open fi +le '$filename' $!"; print "string => \""; while (my $row = <$fh>) { chomp $row; if ( $row =~ /\<U[\w\d]{4}\>.*/) { $row =~ s/\<U([\w\d]{4})\>.*/\\N\{U+$1\}/g; print $row; } } print "\"\n";
Then I transferred the encoded files to my new host. On the new host I created a script called decode_it.pl. It reads in the file, decodes its decoding to utf-8, and back to its original encoding. If the original text and the one after encoding back match, I count this as a succesfull test.

decode_it.pl
#!/usr/bin/env/perl + + + use strict; use warnings; use utf8; use Encode qw (:DEFAULT is_utf8); my %encodings = ( CP924 => "ibm-924_P100-1998", # more encodings here ); exit 0 unless @ARGV; foreach my $enc_file ( @ARGV ) { next if $enc_file eq "decode_it.pl"; next if $enc_file eq "encode_it.pl"; next if $enc_file eq "generate_charmap_for_testing.pl"; unless ( $encodings{$enc_file} ) { print "No valid encoding definition for $enc_file\n"; next; } my $module = "Encode::".$enc_file; eval{ (my $file = $module) =~ s|::|/|g; require $file.'.pm'; $module->import(); 1; } or do { print "$module not found\n"; next; }; open( my $fh_in, '<', $enc_file) or next; my $filecontent = do{ local $/ = undef; # input record separator u +ndefined <$fh_in> }; my $content; eval{ $content = decode ( $encodings{$enc_file}, $filecontent ); } +; if ( $@ ){ print $@,"skipping encoding\n"; next; } my $encoded_content = encode ( $encodings{$enc_file}, $content ); my $decoded_content = decode ( $encodings{$enc_file}, $encoded_con +tent ); if ( $decoded_content eq $content ) { print "Encoding $enc_file is working properly\n"; } else { print "Encoding $enc_file produces errors\n"; } }
Final Output looks like this:
./decode_it.pl * Encoding CP924 is working properly Encoding Cp1025 is working properly Encoding Cp1122 is working properly Encoding Cp1140 is working properly Encoding Cp1141 is working properly Encoding Cp1142 is working properly Encoding Cp1143 is working properly Encoding Cp1144 is working properly Encoding Cp1145 is working properly Encoding Cp1146 is working properly Encoding Cp1147 is working properly Encoding Cp1148 is working properly Encoding Cp1149 is working properly Encoding Cp1153 is working properly Encoding Cp1388 produces errors Encoding Cp1399 produces errors Encoding Cp273 is working properly Encoding Cp285 is working properly Encoding Cp297 is working properly Encoding Cp424 is working properly Encoding Cp870 is working properly Encoding Cp933 produces errors Encoding Cp937 produces errors Encoding CpMacintosh is working properly Encoding CpTIS620 is working properly Encoding Gb18030 is working properly Encoding Gb2312 is working properly Encoding NATSDANO is working properly
It works really well - except for the Chinese EBCDIC encodings. Somehow, the transition does produce different results. The result is the same on my old and the new box.

So, what do you think of my solution?
And does anybody have an idea why the conversion fails for the Chinese Character-Sets?
Kind Regards, Yulivee

Replies are listed 'Best First'.
Re: Properly testing self-compiled character-encodings
by Corion (Patriarch) on Jan 23, 2017 at 12:24 UTC

    Yes, that would be my approach as well (and I should add those cases to Encode::DIN66003. Take a set of strings and their known, manually verified encoding, and test that your module still encodes them properly:

    use Test::More; use Encode 'encode', 'decode'; my @tests = ( { known => "Hello World", bytes_1141 => "Hello World" }, { known => "\N{LATIN CAPITAL LETTER A WITH DIAERESIS}", bytes_1141 + => "{" }, # or whatever { known => "\N{LATIN CAPITAL LETTER U WITH DIAERESIS}", bytes_1141 + => "}" }, # or whatever ); plan 3*@tests; for my $test (@tests) { my( $name ) = $test->{name} || $test->{known}; is encode( 'CP1141', $test->{known} ), $test->{bytes_1141}, "Encod +ing for '$name'" ); is decode( encode( 'CP1141', $test->{known} ), $test->{known}, "Ro +undtrip for '$name'" ); is decode( 'CP1141', $test->{bytes_1141}), $test->{known}, "Decodi +ng for '$name'" ); }; done_testing;

    Some of the test cases won't roundtrip cleanly, but you should likely also test for unknown characters like the Euro sign or curly braces.

    Update: Fixed module name, as spotted by choroba.

Re: Properly testing self-compiled character-encodings
by LanX (Saint) on Jan 23, 2017 at 12:31 UTC
    Hey!

    Maybe of interest, I wrote a routine peek() for visual testing of encoded strings.

    See  Re: Converting utf-8 to base64 and back 

    for automatic testing maybe a loop over eq-tests of octet strings (ie without utf8 flag)?

    probably speeding it up by testing a long string first and only parts of it if the long test failed?

    HTH :)

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1180148]
Approved by Discipulus
Front-paged by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (3)
As of 2024-04-19 21:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found