in reply to dynamically detect code page
But, as noted in another reply, it looks like there is other evidence in each line about the language of origin, so you can make easy educated guesses about which character encoding is appropriate on a line-by-line basis. To the extent that this is true, your processing of the log would look like this:
That should put most of the data into a single, consistent, portable encoding (utf8). For lines whose actual language is misidentified (or unidentifyable), you'll probably see strings like ",????," where the question marks indicate charcters that the Encode module could not convert (because it was told to use the wrong "legacy" code page). Definitely study the man page for Encode.use strict; use Encode; my %encoding = ( JAP => 'shiftjis', RUS => 'cp1251', # and so on... # (figure out the actual encoding names for each "clu +e") ); binmode STDOUT, ":utf8"; while (<>) { my $decoded = ''; for my $lang ( keys %encoding ) { if ( /$lang/ ) { # might need to be careful about how to matc +h for language # e.g. split into fields with Text::xSV, and + test one field $decoded = decode( $encoding{$lang}, $_ ); last; } } if ( $decoded eq '' ) { warn "no language discernable at line $.\n"; $decoded = decode( 'cp1252', $_ ); # assume Latin1 as a defaul +t } print $decoded; }
While you're at it, it might make sense to divvy the lines into separate output files, according to language. What would be the point of including German entries in a log file that the Japanese are going to read, or vice-versa?
(For that matter, it would simplify things for you quite a bit if you could use those language cues in the ASCII content to start by splitting the log into separate files by language; then the character encodings will cease to be an issue.)
|
|---|