comment on

Having a single log file that mixes lines with different non-unicode encodings is a very bad idea -- whoever came up with that idea should be asked to look elsewhere for employment (or simply introduced to others as "the one who made that really stupid mistake with the log files").

But, as noted in another reply, it looks like there is other evidence in each line about the language of origin, so you can make easy educated guesses about which character encoding is appropriate on a line-by-line basis. To the extent that this is true, your processing of the log would look like this:

use strict;
use Encode;

my %encoding = ( JAP => 'shiftjis',
                 RUS => 'cp1251',
                 # and so on... 
                 # (figure out the actual encoding names for each "clu
+e")
                );

binmode STDOUT, ":utf8";

while (<>) {
    my $decoded = '';
    for my $lang ( keys %encoding ) {
        if ( /$lang/ ) {  # might need to be careful about how to matc
+h for language
                          # e.g. split into fields with Text::xSV, and
+ test one field
            $decoded = decode( $encoding{$lang}, $_ );
            last;
        }
    }
    if ( $decoded eq '' ) {
        warn "no language discernable at line $.\n";
        $decoded = decode( 'cp1252', $_ ); # assume Latin1 as a defaul
+t
    }
    print $decoded;
}
[download]

That should put most of the data into a single, consistent, portable encoding (utf8). For lines whose actual language is misidentified (or unidentifyable), you'll probably see strings like ",????," where the question marks indicate charcters that the Encode module could not convert (because it was told to use the wrong "legacy" code page). Definitely study the man page for Encode.

While you're at it, it might make sense to divvy the lines into separate output files, according to language. What would be the point of including German entries in a log file that the Japanese are going to read, or vice-versa?

(For that matter, it would simplify things for you quite a bit if you could use those language cues in the ASCII content to start by splitting the log into separate files by language; then the character encodings will cease to be an issue.)

In reply to Re: dynamically detect code page by graff
in thread dynamically detect code page by edwardt_tril

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.