dynamically detect code page

edwardt_tril has asked for the wisdom of the Perl Monks concerning the following question:

Hi I have a file that has log lines from mulitple windows OS in different languges.
The log file is always in ASCII (native format) on which the log file is saved, that means when the log file is
saved & opened in Notepad under an English OS, the Japanese and
German Chracter will display as native
charater sets (see below) . When the same log file is opened in Japnese OS, the Japanese characters will be displayed as
Japanese but the German ones still look like junk. I need to parse these log files to extract each fields to do
substitution, add or delete the fields, then write the result back to a new log file. (see Note)
I want to extract and replace those fields with new user defined ones (of course the user need
to supply the field in the same code page as that particular log line.
Is there a way to dynamically detect which code page(which character set) each line is using? Do I need to do that for my purpose?
What other ideas you guys have in mind?
Note:
1. each line is newline terminated
2. each field (or column) is delimited with comma (,) in the line 2. note that each field allows \\ inside the
field, that will interfere with the regex actions in Perl I think.
Thanks

230913132C20,50,1,131174,JAP-LINKSYS3-0,Administrator,,C:\Program File
+s\netjumper\linkgrabber99\TSImages\testraffles.jpg,1,4,1,0,1090519040
+,"",1129718602,,0,101        0    0    File Remediation    Delete    
+C:\\Program Files\\netjumper\\linkgrabber99\\TSImages\\testraffles.jp
+g            2001    1    3f2d7104-9ac7-4867-aa40-73ea69b9a6a2,120783
+7606,4294905926,0,0,0,0,0,0,,0,0,0,0,JAP-NETGEAR3-0,{F38B5FB1-C17D-48
+4D-AB83-11D2F12B09E9},,(IP)-10.160.32.162,JAPSCS30,WGSCS3.0,00:0D:56:
+7E:99:D3,10.0.0.359,,,,,,,,,,,,,,,,0,2A2AD14D03D22042B368E2D057BF9AE1
+,1184e2df-fd6d-4911-a004-b244a204cd43,78381056,JAP-LINKSYS3-0
230913132C20,50,1,131174,JAP-LINKSYS3-0,Administrator,,C:\Program File
+s\netjumper\linkgrabber99\INSTALL.LOG,1,4,1,0,1090519040,"",112971860
+2,,0,101        0    0    File Remediation    Delete    C:\\Program F
+iles\\netjumper\\linkgrabber99\\INSTALL.LOG            2001    1    3
+f2d7104-9ac7-4867-aa40-73ea69b9a6a2,1207837607,4294905926,0,0,0,0,0,0
+,,0,0,0,0,JAP-NETGEAR3-0,{F38B5FB1-C17D-484D-AB83-11D2F12B09E9},,(IP)
+-10.160.32.162,JAPSCS30,WGSCS3.0,00:0D:56:7E:99:D3,10.0.0.359,,,,,,,,
+,,,,,,,,0,2A2AD14D03D22042B368E2D057BF9AE1,1184e2df-fd6d-4911-a004-b2
+44a204cd43,78381056,JAP-LINKSYS3-0
230913132C20,50,1,131174,JAP-LINKSYS3-0,Administrator,,C:\Program File
+s\netjumper\linkgrabber99\ReadMe.txt,1,4,1,0,1090519040,"",1129718602
+,,0,101        0    0    File Remediation    Delete    C:\\Program Fi
+les\\netjumper\\linkgrabber99\\ReadMe.txt            2001    1    3f2
+d7104-9ac7-4867-aa40-73ea69b9a6a2,1207837608,4294905926,0,0,0,0,0,0,,
+0,0,0,0,JAP-NETGEAR3-0,{F38B5FB1-C17D-484D-AB83-11D2F12B09E9},,(IP)-1
+0.160.32.162,JAPSCS30,WGSCS3.0,00:0D:56:7E:99:D3,10.0.0.359,,,,,,,,,,
+,,,,,,0,2A2AD14D03D22042B368E2D057BF9AE1,1184e2df-fd6d-4911-a004-b244
+a204cd43,78381056,JAP-LINKSYS3-0
230913132C20,50,1,131174,JAP-LINKSYS3-0,Administrator,,C:\Program File
+s\netjumper\linkgrabber99\UNWISE.EXE,1,4,1,0,1090519040,"",1129718602
+,,0,101        0    0    File Remediation    Delete    C:\\Program Fi
+les\\netjumper\\linkgrabber99\\UNWISE.EXE            2001    1    3f2
+d7104-9ac7-4867-aa40-73ea69b9a6a2,1207837609,4294905926,0,0,0,0,0,0,,
+0,0,0,0,JAP-NETGEAR3-0,{F38B5FB1-C17D-484D-AB83-11D2F12B09E9},,(IP)-1
+0.160.32.162,JAPSCS30,WGSCS3.0,00:0D:56:7E:99:D3,10.0.0.359,,,,,,,,,,
+,,,,,,0,2A2AD14D03D22042B368E2D057BF9AE1,1184e2df-fd6d-4911-a004-b244
+a204cd43,78381056,JAP-LINKSYS3-0
230913132C20,50,1,131174,JAP-LINKSYS3-0,Administrator,,C:\Documents an
+d Settings\Administrator\ƒXƒ^[ƒg ƒƒjƒ…[\ƒvƒƒOƒ‰ƒ€\LinkGrabber99\L
+inkGrabber99.lnk,1,4,1,0,1090519040,"",1129718602,,0,101        0    
+0    File Remediation    Delete    C:\\Documents and Settings\\Admini
+strator\\ƒXƒ^[ƒg ƒƒjƒ…[\\ƒvƒƒOƒ‰ƒ€\\LinkGrabber99\\LinkGrabber99.
+lnk            2001    1    3f2d7104-9ac7-4867-aa40-73ea69b9a6a2,1207
+837610,4294905926,0,0,0,0,0,0,,0,0,0,0,JAP-NETGEAR3-0,{F38B5FB1-C17D-
+484D-AB83-11D2F12B09E9},,(IP)-10.160.32.162,JAPSCS30,WGSCS3.0,00:0D:5
+6:7E:99:D3,10.0.0.359,,,,,,,,,,,,,,,,0,2A2AD14D03D22042B368E2D057BF9A
+E1,1184e2df-fd6d-4911-a004-b244a204cd43,78381056,JAP-LINKSYS3-0
230913132C20,50,1,131174,JAP-LINKSYS3-0,Administrator,,C:\Documents an
+d Settings\Administrator\ƒXƒ^[ƒg ƒƒjƒ…[\ƒvƒƒOƒ‰ƒ€\LinkGrabber99\U
+nwise.lnk,1,4,1,0,1090519040,"",1129718602,,0,101        0    0    Fi
+le Remediation    Delete    C:\\Documents and Settings\\Administrator
+\\ƒXƒ^[ƒg ƒƒjƒ…[\\ƒvƒƒOƒ‰ƒ€\\LinkGrabber99\\Unwise.lnk           
+ 2001    1    3f2d7104-9ac7-4867-aa40-73ea69b9a6a2,1207837611,4294905
+926,0,0,0,0,0,0,,0,0,0,0,JAP-NETGEAR3-0,{F38B5FB1-C17D-484D-AB83-11D2
+F12B09E9},,(IP)-10.160.32.162,JAPSCS30,WGSCS3.0,00:0D:56:7E:99:D3,10.
+0.0.359,,,,,,,,,,,,,,,,0,2A2AD14D03D22042B368E2D057BF9AE1,1184e2df-fd
+6d-4911-a004-b244a204cd43,78381056,JAP-LINKSYS3-0
230A08123534,6,2,1,GER-NETGEAR3-0,Administrator,,,,,,,16777216,"Could 
+not scan 1 files inside D:\Project\DUMBV\all DUMBVirus\Crash in Turbo
+\TMDGB292.cab due to extraction errors encountered by the Decomposer 
+Engines.",0,,0,,,,,0,,,,,,,,,,,{E36FDC15-54A2-484A-BA84-998C32062FC4}
+,,(IP)-10.160.32.144,GER_SCS30,WG_GER-ENG,00:12:3F:61:75:21,10.0.0.35
+9,,,,,,,,,,,,,,,,0,A63A014939DAB04B9169884492DA3F9F,,,GER-NETGEAR3-0
230A08123534,6,2,1,GER-NETGEAR3-0,Administrator,,,,,,,16777216,"Could 
+not scan 1 files inside D:\Project\DUMBV\all DUMBVirus\Crash in Turbo
+\TMTC8DD0.cab due to extraction errors encountered by the Decomposer 
+Engines.",0,,0,,,,,0,,,,,,,,,,,{E36FDC15-54A2-484A-BA84-998C32062FC4}
+,,(IP)-10.160.32.144,GER_SCS30,WG_GER-ENG,00:12:3F:61:75:21,10.0.0.35
+9,,,,,,,,,,,,,,,,0,A63A014939DAB04B9169884492DA3F9F,,,GER-NETGEAR3-0
230A08123534,5,1,1,GER-NETGEAR3-0,Administrator,Dir II.A,D:\Project\DU
+MBV\all DUMBVirus\DB1.LZH>>‚©‚ñ‚¶.COM,5,1,1,2147483904,16420,"",11314
+71642,,0,,0,433,0,0,0,1,1,1,20051107.019,49622,2,5,0,,{E36FDC15-54A2-
+484A-BA84-998C32062FC4},,(IP)-10.160.32.144,GER_SCS30,WG_GER-ENG,00:1
+2:3F:61:75:21,10.0.0.359,,,,,,,,,,,,,,,,0,A63A014939DAB04B9169884492D
+A3F9F,,0,GER-NETGEAR3-0
230A08123534,5,1,1,GER-NETGEAR3-0,Administrator,DSCE.2100,D:\Project\D
+UMBV\all DUMBVirus\DB1.LZH>>ƒ|ƒ\.COM,5,1,1,2147483904,17444,"",113147
+1642,,0,,0,12253,0,0,0,1,1,2,20051107.019,49622,0,4,0,,{E36FDC15-54A2
+-484A-BA84-998C32062FC4},,(IP)-10.160.32.144,GER_SCS30,WG_GER-ENG,00:
+12:3F:61:75:21,10.0.0.359,,,,,,,,,,,,,,,,0,A63A014939DAB04B9169884492
+DA3F9F,,0,GER-NETGEAR3-0
230A08123534,5,1,1,GER-NETGEAR3-0,Administrator,XM.Laroux.A,D:\Project
+\DUMBV\all DUMBVirus\DB1.LZH>>‚©‚½‚©‚È.XLS,5,1,1,2147484928,17444,"",
+1131471642,,0,,0,8105,0,0,0,1,1,3,20051107.019,49622,0,4,0,,{E36FDC15
+-54A2-484A-BA84-998C32062FC4},,(IP)-10.160.32.144,GER_SCS30,WG_GER-EN
+G,00:12:3F:61:75:21,10.0.0.359,,,,,,,,,,,,,,,,0,A63A014939DAB04B91698
+84492DA3F9F,,0,GER-NETGEAR3-0
230A08123534,5,1,1,GER-NETGEAR3-0,Administrator,WM.NPAD Variant,D:\Pro
+ject\DUMBV\all DUMBVirus\DB1.LZH>>“ú–{.DOT,5,1,1,2147484928,17444,"",
+1131471642,,0,,0,7890,0,0,0,1,1,4,20051107.019,49622,0,4,0,,{E36FDC15
+-54A2-484A-BA84-998C32062FC4},,(IP)-10.160.32.144,GER_SCS30,WG_GER-EN
+G,00:12:3F:61:75:21,10.0.0.359,,,,,,,,,,,,,,,,0,A63A014939DAB04B91698
+84492DA3F9F,,0,GER-NETGEAR3-0
230A08123534,5,1,1,GER-NETGEAR3-0,Administrator,Jeru.1808.Frere Jac,D:
+\Project\DUMBV\all DUMBVirus\DB1.LZH>>úX.EXE,5,1,1,2147483904,17444,"
+",1131471642,,0,,0,755,0,0,0,1,1,5,20051107.019,49622,0,4,0,,{E36FDC1
+5-54A2-484A-BA84-998C32062FC4},,(IP)-10.160.32.144,GER_SCS30,WG_GER-E
+NG,00:12:3F:61:75:21,10.0.0.359,,,,,,,,,,,,,,,,0,A63A014939DAB04B9169
+884492DA3F9F,,0,GER-NETGEAR3-0
[download]

Comment on dynamically detect code page Download Code

Replies are listed 'Best First'.
Re: dynamically detect code page by duckyd (Hermit) on Feb 23, 2006 at 20:05 UTC
Is the file really written using different encodings for different languages? I.E., it's not all UTF-8? If it's really got multiple encodings, I think the best you'll be able to do is make a best-guess. Since it looks like your lines are pretty standard, you can probably do pretty well. See perldoc Encode. I would try treating the data as each possible language, and checking for unlikely characters or combinations of characters in the result.	[reply]
Re: dynamically detect code page by blahblahblah (Priest) on Feb 24, 2006 at 02:12 UTC
I notice that some lines in your example contain "GER-" and some contain "JAP-". Could you use that as an indicator for which lines contain which characters? As far as detecting the Japanese encoding of a string, Jcode might be useful. (See the getcode sub.) Like duckyd alluded to, if you have any control over the writing of the file, writing it in a single encoding -- utf-8 -- could make reading it a lot easier.	[reply]
Re: dynamically detect code page by graff (Chancellor) on Feb 24, 2006 at 03:16 UTC
Having a single log file that mixes lines with different non-unicode encodings is a very bad idea -- whoever came up with that idea should be asked to look elsewhere for employment (or simply introduced to others as "the one who made that really stupid mistake with the log files"). But, as noted in another reply, it looks like there is other evidence in each line about the language of origin, so you can make easy educated guesses about which character encoding is appropriate on a line-by-line basis. To the extent that this is true, your processing of the log would look like this: use strict; use Encode; my %encoding = ( JAP => 'shiftjis', RUS => 'cp1251', # and so on... # (figure out the actual encoding names for each "clu +e") ); binmode STDOUT, ":utf8"; while (<>) { my $decoded = ''; for my $lang ( keys %encoding ) { if ( /$lang/ ) { # might need to be careful about how to matc +h for language # e.g. split into fields with Text::xSV, and + test one field $decoded = decode( $encoding{$lang}, $_ ); last; } } if ( $decoded eq '' ) { warn "no language discernable at line $.\n"; $decoded = decode( 'cp1252', $_ ); # assume Latin1 as a defaul +t } print $decoded; } [download] That should put most of the data into a single, consistent, portable encoding (utf8). For lines whose actual language is misidentified (or unidentifyable), you'll probably see strings like ",????," where the question marks indicate charcters that the Encode module could not convert (because it was told to use the wrong "legacy" code page). Definitely study the man page for Encode. While you're at it, it might make sense to divvy the lines into separate output files, according to language. What would be the point of including German entries in a log file that the Japanese are going to read, or vice-versa? (For that matter, it would simplify things for you quite a bit if you could use those language cues in the ASCII content to start by splitting the log into separate files by language; then the character encodings will cease to be an issue.)	[reply] [d/l]
Re: dynamically detect code page by spiritway (Vicar) on Feb 24, 2006 at 04:56 UTC
What you're proposing is to try to fix something that is needlessly broken to begin with. It would be better if you could avoid this problem in the first place, by simply logging different language messages to different logs, possibly bearing such catchy file names as "japanese.log", "german.log", and so on. If this isn't feasible (your PHB wrote the script and wants you to use it), then you will probably need to pull this log apart, line-by-line, like so: `if ($line =~/JAP/) { # Logic to print to the Japanese file...; } elsif ($line =~/GER/) { # Logic to print to the German file...; } else { # Logic to print to the English file...; };` [download] It seems to me that it is probably not necessary for you to recreate each line in its original language, at the same time (for the same user). Chances are that most people won't speak (and read) all the languages that are being used.	[reply] [d/l]
Re^2: dynamically detect code page by edwardt_tril (Sexton) on Feb 24, 2006 at 07:38 UTC
Hi thanks for all the replies. The GER- and JAP- is just the computer name and is user defined and has no replationship with the actual encoding of the log line. I have no control of the log file because it is given as an output form some other program. .... :<	[reply]
Re: dynamically detect code page by Anonymous Monk on Feb 24, 2006 at 13:06 UTC
are Japanese (SJIS) and German (cpXXXX) the only languages? since SJIS is a multibyte encoding and was created by MicroSloth to be semi-compatible with one of their old codepages... they did a hack and slash of the standard JIS/EUC type multibyte encodings. because of this SJIS is relatively easy to detect, if you try to treat a randomish string of 8-bit characters as SJIS and convert it to UTF-8 you'll most likely end up with invalid characters. a string of 7-bit characters in SJIS is just the same a 7-bit ASCII. so take each line one at a time. if there are no 8-bit characters then the line could be anything, but it doesn't matter because all the characters are in the 7-bit ASCII range. if there are 8-bit characters, pretend the line is SJIS and try to convert it to UTF-8. if there are errors, then the line is most likely not SJIS and instead is German (or some other single-byte codepage). if there are no errors then it's most likely SJIS Japanese. as for telling apart other single-byte codepages... no idea.	[reply]
Re^2: dynamically detect code page by edwardt_tril (Sexton) on Feb 25, 2006 at 02:04 UTC
Japanese and German are not the only ones, they have chinese (traditional/simplified), spanish, french, polish (basically all the HIASCII and DBCS tht windows support). So what is that the log files are created by the application on the native windows OS suing whatever default locale on the platform, then those logs are forwarded and collected by some other machine, and saved into one big file. Again the big file format depends on the native locale of the machine that does the log collection. Thanks	[reply]
Re^3: dynamically detect code page by edwardt_tril (Sexton) on Feb 25, 2006 at 04:05 UTC
actually .. just wondering.. if it would work if I take in each line and transform all line as UTF-8, and use all utf-8 operations in string regex. would that work? new to i18N manipulation in perl.	[reply]
Re^4: dynamically detect code page by Anonymous Monk on Feb 25, 2006 at 11:37 UTC