yulivee07 has asked for the wisdom of the Perl Monks concerning the following question:
strings just returns nothing. Using another C-Compiler solved this problem and produced binary files that contain characters.$ strings CP1141.so $
to generate the string in the hash of the encoding, I crawled the UCM-File of the corresponding character-encoding to get the name of all unicode-points to include. The script takes a ucm-file as input and prints alls unicode-points in format \N{U+0001} to STDOUT#!/usr/bin/env perl use strict; use warnings; use utf8; use Encode qw (:DEFAULT is_utf8); use Encode::CP924; # more encodings here, removed to save space my %encodings = ( CP924 => { name => "ibm-924_P100-1998", string => "\N{U+0000}\N{U+0001}\N{U+0002}[...]" }, #more encodings here, removed to save space ); foreach my $encoding ( sort keys %encodings ) { print "Current Encoding: $encoding - $encodings{$encoding}{'name'} + \n"; my $utf8_decode = $encodings{$encoding}{'string'}; my $encoded_output; eval { $encoded_output = encode( $encodings{$encoding}{'name'}, $u +tf8_decode ); }; # filecontent is encoded from utf-8 to current encod +ing if ( $@ ){ print $@,"skipping encoding\n"; next; } open ( my $fh_out, '>', $encoding ) or die; print $fh_out $encoded_output; close $fh_out; }
Then I transferred the encoded files to my new host. On the new host I created a script called decode_it.pl. It reads in the file, decodes its decoding to utf-8, and back to its original encoding. If the original text and the one after encoding back match, I count this as a succesfull test.#!/usr/bin/perl + + + use strict; use warnings; use Getopt::Long; our %opt = (); { my %options = ( 'file=s' => \$opt{file}, ); GetOptions(%options); } exit 0 unless $opt{file}; my $filename = $opt{file}; open(my $fh, '<:encoding(UTF-8)', $filename) or die "Could not open fi +le '$filename' $!"; print "string => \""; while (my $row = <$fh>) { chomp $row; if ( $row =~ /\<U[\w\d]{4}\>.*/) { $row =~ s/\<U([\w\d]{4})\>.*/\\N\{U+$1\}/g; print $row; } } print "\"\n";
Final Output looks like this:#!/usr/bin/env/perl + + + use strict; use warnings; use utf8; use Encode qw (:DEFAULT is_utf8); my %encodings = ( CP924 => "ibm-924_P100-1998", # more encodings here ); exit 0 unless @ARGV; foreach my $enc_file ( @ARGV ) { next if $enc_file eq "decode_it.pl"; next if $enc_file eq "encode_it.pl"; next if $enc_file eq "generate_charmap_for_testing.pl"; unless ( $encodings{$enc_file} ) { print "No valid encoding definition for $enc_file\n"; next; } my $module = "Encode::".$enc_file; eval{ (my $file = $module) =~ s|::|/|g; require $file.'.pm'; $module->import(); 1; } or do { print "$module not found\n"; next; }; open( my $fh_in, '<', $enc_file) or next; my $filecontent = do{ local $/ = undef; # input record separator u +ndefined <$fh_in> }; my $content; eval{ $content = decode ( $encodings{$enc_file}, $filecontent ); } +; if ( $@ ){ print $@,"skipping encoding\n"; next; } my $encoded_content = encode ( $encodings{$enc_file}, $content ); my $decoded_content = decode ( $encodings{$enc_file}, $encoded_con +tent ); if ( $decoded_content eq $content ) { print "Encoding $enc_file is working properly\n"; } else { print "Encoding $enc_file produces errors\n"; } }
It works really well - except for the Chinese EBCDIC encodings. Somehow, the transition does produce different results. The result is the same on my old and the new box../decode_it.pl * Encoding CP924 is working properly Encoding Cp1025 is working properly Encoding Cp1122 is working properly Encoding Cp1140 is working properly Encoding Cp1141 is working properly Encoding Cp1142 is working properly Encoding Cp1143 is working properly Encoding Cp1144 is working properly Encoding Cp1145 is working properly Encoding Cp1146 is working properly Encoding Cp1147 is working properly Encoding Cp1148 is working properly Encoding Cp1149 is working properly Encoding Cp1153 is working properly Encoding Cp1388 produces errors Encoding Cp1399 produces errors Encoding Cp273 is working properly Encoding Cp285 is working properly Encoding Cp297 is working properly Encoding Cp424 is working properly Encoding Cp870 is working properly Encoding Cp933 produces errors Encoding Cp937 produces errors Encoding CpMacintosh is working properly Encoding CpTIS620 is working properly Encoding Gb18030 is working properly Encoding Gb2312 is working properly Encoding NATSDANO is working properly
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Properly testing self-compiled character-encodings
by Corion (Patriarch) on Jan 23, 2017 at 12:24 UTC | |
|
Re: Properly testing self-compiled character-encodings
by LanX (Saint) on Jan 23, 2017 at 12:31 UTC |