comment on

I'm not familiar with 5.6.x versions, but if you're stuck with 5.6.4, then I'm guessing that you don't have access to the PerlIO layers or the Encode module, which make it very easy to handle all forms of unicode (and most legacy character sets as well). Maybe you don't even have the "U" data type in your version of pack/unpack (for handling Unicode characters).

So, given that limitation, my suggestion would be to try to determine whether the UTF16 files always start with a byte-order mark; on a windows (little-endian) box, the UTF16 will doubtless be little-endian, and the byte-order mark, if present, will be a 16-bit unsigned integer with the value 0xfeff.

Any UTF16_LE file that starts with a byte-order mark will pass the following test:

my $bom;
open IN, $filename or die $!;
read IN, $bom, 2;

my $bomval = unpack 'S', $bom;

if ( $bomval == 0xfeff ) {
   # this is bound to be a utf16 file --
   # you've already read the bom, so just move on and read the data
}
close IN;
[download]

If your data files don't start with a BOM, then maybe you can determine whether they contain any UTF16 characters in the ASCII range (these will have a high-byte of zero). In plain ASCII files and UTF8 files, you virtually never see null bytes; but to the extent that UTF16 files contain characters in the ASCII range, every other byte is null. So lacking a BOM, count null bytes:

my $size = -s $filename;
$size = 128 if $size > 128;

my $test;
open IN, $filename or die $!;
read IN, $test, $size;
seek IN, 0, 0;

my @bytes = unpack 'C*', $test;
my $nullhibytes = 0;
for ( my $i=o $i<$size; $i+=2 ) {
    $nullhibytes++ if ( $bytes[$i+1] == 0 and $bytes[$i] =~ /[ -~]/);
}

if ( $nullhibytes > 8 ) { 
    # this is probably a utf16 file (if it's text at all)
}
close IN;
[download]

As for handling the XML tags, well, I'm not sure I understand what you're doing. But if your log files don't really contain character data outside the ASCII range (i.e. half the bytes in each file are null), then I'd say just strip out the null bytes and use XML::Parser or XML::Simple in the normal way.

(Are you not able to install current versions of the XML modules, for the same reason you can't use perl 5.8.x? The XML::Parser version I have allows for reading UTF16 data straight from disk, just by setting an initial parameter for the parser object.)

In reply to Re: how do I check encoding before opening FILEHANDLE by graff
in thread how do I check encoding before opening FILEHANDLE by dbrock

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.