in reply to dynamically detect code page

are Japanese (SJIS) and German (cpXXXX) the only languages?

since SJIS is a multibyte encoding and was created by MicroSloth to be semi-compatible with one of their old codepages... they did a hack and slash of the standard JIS/EUC type multibyte encodings. because of this SJIS is relatively easy to detect, if you try to treat a randomish string of 8-bit characters as SJIS and convert it to UTF-8 you'll most likely end up with invalid characters. a string of 7-bit characters in SJIS is just the same a 7-bit ASCII.

so take each line one at a time. if there are no 8-bit characters then the line could be anything, but it doesn't matter because all the characters are in the 7-bit ASCII range.

if there are 8-bit characters, pretend the line is SJIS and try to convert it to UTF-8. if there are errors, then the line is most likely not SJIS and instead is German (or some other single-byte codepage). if there are no errors then it's most likely SJIS Japanese.

as for telling apart other single-byte codepages... no idea.

Replies are listed 'Best First'.
Re^2: dynamically detect code page
by edwardt_tril (Sexton) on Feb 25, 2006 at 02:04 UTC
    Japanese and German are not the only ones, they have chinese (traditional/simplified), spanish, french, polish (basically
    all the HIASCII and DBCS tht windows support). So what is that the log files are created by the application on the
    native windows OS suing whatever default locale on the platform, then those logs are forwarded and collected by some
    other machine, and saved into one big file. Again the big file format depends on the native locale of the machine that
    does the log collection. Thanks
      actually .. just wondering.. if it would work if I take in
      each line and transform all line as UTF-8, and use all
      utf-8 operations in string regex. would that work? new to i18N manipulation in perl.

        you're pretty much out of luck. you really need to get more information than you have available in just the log file.

        from the log provided, it looks like the multiple-charset part is a filename (possibly containing a virus or some such). maybe the filename is really short, one or two characters. there's no way to tell which codepage is the correct one. for instance, the single byte 0xE5 can be any of the following in just the first few ISO-8859 encodings...

        å
        ĺ
        ċ
        х
        م
        ε
        

        there is no way to correctly convert this one (or two or three) byte filename/whatever into UTF-8 unless you know the correct codepage beforehand. it just isn't going to happen.

        you'll have to arrange to recieve a list of the machine names and their respective codepages beforehand. but once you have that, it is pretty easy to convert everything to UTF-8 and do any sort of regex manipulation.