in reply to Re: dynamically detect code page
in thread dynamically detect code page

Japanese and German are not the only ones, they have chinese (traditional/simplified), spanish, french, polish (basically
all the HIASCII and DBCS tht windows support). So what is that the log files are created by the application on the
native windows OS suing whatever default locale on the platform, then those logs are forwarded and collected by some
other machine, and saved into one big file. Again the big file format depends on the native locale of the machine that
does the log collection. Thanks

Replies are listed 'Best First'.
Re^3: dynamically detect code page
by edwardt_tril (Sexton) on Feb 25, 2006 at 04:05 UTC
    actually .. just wondering.. if it would work if I take in
    each line and transform all line as UTF-8, and use all
    utf-8 operations in string regex. would that work? new to i18N manipulation in perl.

      you're pretty much out of luck. you really need to get more information than you have available in just the log file.

      from the log provided, it looks like the multiple-charset part is a filename (possibly containing a virus or some such). maybe the filename is really short, one or two characters. there's no way to tell which codepage is the correct one. for instance, the single byte 0xE5 can be any of the following in just the first few ISO-8859 encodings...

      å
      ĺ
      ċ
      х
      م
      ε
      

      there is no way to correctly convert this one (or two or three) byte filename/whatever into UTF-8 unless you know the correct codepage beforehand. it just isn't going to happen.

      you'll have to arrange to recieve a list of the machine names and their respective codepages beforehand. but once you have that, it is pretty easy to convert everything to UTF-8 and do any sort of regex manipulation.