in reply to how do I check encoding before opening FILEHANDLE
So, given that limitation, my suggestion would be to try to determine whether the UTF16 files always start with a byte-order mark; on a windows (little-endian) box, the UTF16 will doubtless be little-endian, and the byte-order mark, if present, will be a 16-bit unsigned integer with the value 0xfeff.
Any UTF16_LE file that starts with a byte-order mark will pass the following test:
If your data files don't start with a BOM, then maybe you can determine whether they contain any UTF16 characters in the ASCII range (these will have a high-byte of zero). In plain ASCII files and UTF8 files, you virtually never see null bytes; but to the extent that UTF16 files contain characters in the ASCII range, every other byte is null. So lacking a BOM, count null bytes:my $bom; open IN, $filename or die $!; read IN, $bom, 2; my $bomval = unpack 'S', $bom; if ( $bomval == 0xfeff ) { # this is bound to be a utf16 file -- # you've already read the bom, so just move on and read the data } close IN;
As for handling the XML tags, well, I'm not sure I understand what you're doing. But if your log files don't really contain character data outside the ASCII range (i.e. half the bytes in each file are null), then I'd say just strip out the null bytes and use XML::Parser or XML::Simple in the normal way.my $size = -s $filename; $size = 128 if $size > 128; my $test; open IN, $filename or die $!; read IN, $test, $size; seek IN, 0, 0; my @bytes = unpack 'C*', $test; my $nullhibytes = 0; for ( my $i=o $i<$size; $i+=2 ) { $nullhibytes++ if ( $bytes[$i+1] == 0 and $bytes[$i] =~ /[ -~]/); } if ( $nullhibytes > 8 ) { # this is probably a utf16 file (if it's text at all) } close IN;
(Are you not able to install current versions of the XML modules, for the same reason you can't use perl 5.8.x? The XML::Parser version I have allows for reading UTF16 data straight from disk, just by setting an initial parameter for the parser object.)
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: how do I check encoding before opening FILEHANDLE
by dbrock (Sexton) on Feb 18, 2005 at 20:31 UTC |