Re: dynamically detect code page

Having a single log file that mixes lines with different non-unicode encodings is a very bad idea -- whoever came up with that idea should be asked to look elsewhere for employment (or simply introduced to others as "the one who made that really stupid mistake with the log files").

But, as noted in another reply, it looks like there is other evidence in each line about the language of origin, so you can make easy educated guesses about which character encoding is appropriate on a line-by-line basis. To the extent that this is true, your processing of the log would look like this:

use strict;
use Encode;

my %encoding = ( JAP => 'shiftjis',
                 RUS => 'cp1251',
                 # and so on... 
                 # (figure out the actual encoding names for each "clu
+e")
                );

binmode STDOUT, ":utf8";

while (<>) {
    my $decoded = '';
    for my $lang ( keys %encoding ) {
        if ( /$lang/ ) {  # might need to be careful about how to matc
+h for language
                          # e.g. split into fields with Text::xSV, and
+ test one field
            $decoded = decode( $encoding{$lang}, $_ );
            last;
        }
    }
    if ( $decoded eq '' ) {
        warn "no language discernable at line $.\n";
        $decoded = decode( 'cp1252', $_ ); # assume Latin1 as a defaul
+t
    }
    print $decoded;
}
[download]

That should put most of the data into a single, consistent, portable encoding (utf8). For lines whose actual language is misidentified (or unidentifyable), you'll probably see strings like ",????," where the question marks indicate charcters that the Encode module could not convert (because it was told to use the wrong "legacy" code page). Definitely study the man page for Encode.

While you're at it, it might make sense to divvy the lines into separate output files, according to language. What would be the point of including German entries in a log file that the Japanese are going to read, or vice-versa?

(For that matter, it would simplify things for you quite a bit if you could use those language cues in the ASCII content to start by splitting the log into separate files by language; then the character encodings will cease to be an issue.)

Comment on Re: dynamically detect code page Download Code