in reply to how do I check encoding before opening FILEHANDLE

I'm not familiar with 5.6.x versions, but if you're stuck with 5.6.4, then I'm guessing that you don't have access to the PerlIO layers or the Encode module, which make it very easy to handle all forms of unicode (and most legacy character sets as well). Maybe you don't even have the "U" data type in your version of pack/unpack (for handling Unicode characters).

So, given that limitation, my suggestion would be to try to determine whether the UTF16 files always start with a byte-order mark; on a windows (little-endian) box, the UTF16 will doubtless be little-endian, and the byte-order mark, if present, will be a 16-bit unsigned integer with the value 0xfeff.

Any UTF16_LE file that starts with a byte-order mark will pass the following test:

my $bom; open IN, $filename or die $!; read IN, $bom, 2; my $bomval = unpack 'S', $bom; if ( $bomval == 0xfeff ) { # this is bound to be a utf16 file -- # you've already read the bom, so just move on and read the data } close IN;
If your data files don't start with a BOM, then maybe you can determine whether they contain any UTF16 characters in the ASCII range (these will have a high-byte of zero). In plain ASCII files and UTF8 files, you virtually never see null bytes; but to the extent that UTF16 files contain characters in the ASCII range, every other byte is null. So lacking a BOM, count null bytes:
my $size = -s $filename; $size = 128 if $size > 128; my $test; open IN, $filename or die $!; read IN, $test, $size; seek IN, 0, 0; my @bytes = unpack 'C*', $test; my $nullhibytes = 0; for ( my $i=o $i<$size; $i+=2 ) { $nullhibytes++ if ( $bytes[$i+1] == 0 and $bytes[$i] =~ /[ -~]/); } if ( $nullhibytes > 8 ) { # this is probably a utf16 file (if it's text at all) } close IN;
As for handling the XML tags, well, I'm not sure I understand what you're doing. But if your log files don't really contain character data outside the ASCII range (i.e. half the bytes in each file are null), then I'd say just strip out the null bytes and use XML::Parser or XML::Simple in the normal way.

(Are you not able to install current versions of the XML modules, for the same reason you can't use perl 5.8.x? The XML::Parser version I have allows for reading UTF16 data straight from disk, just by setting an initial parameter for the parser object.)

Replies are listed 'Best First'.
Re^2: how do I check encoding before opening FILEHANDLE
by dbrock (Sexton) on Feb 18, 2005 at 20:31 UTC
    Thank you... I will try this... As for the XML tags, (Decoding UTF-16 to ASCII) I have attempted using the XML::Parser but I noticed that the my extract text is placed inside of a %Hash... I process from the rest of my script from a @Array... DBrock...