how do I check encoding before opening FILEHANDLE

dbrock has asked for the wisdom of the Perl Monks concerning the following question:

Hello... I have a question to ask... summerize - I have written a script that opens a utf-16 .xml file(s)... that strips out the XML tags and places the text from the XML into an @array for parsing... The XML files are from a backup software tool, the tool writes it's XML logfiles in both utf-16 and also in utf-8 (I need to skip the utf-8 files)the files are in the same directory using the same filename convention... I can already minipulate the utf-16 XML data as I desire however whan I try to process the source directory I also have the UTF-8 XML file there and my open FILEHANDLE function fails with a :BOM error when I attempt to open the UTF-8 files... On Chatter Box I have ask the question: how can I discover the encoding of a file before opening the file handle I have tried this using next unless '-B $file'; however this does not work... I was pointed to a module (File::BOM) however this modile is not supported on the intel platform (perl 5.6.4 activestate)

if( $logfile =~ /.+\.xml/){
  next unless '-B $logfile';
  open(XMLFILE, '<:encoding(utf16)', $logfile)or die "Can't Open:$!";
  while(<XMLFILE>) {
    $_ =~ s/^.*(<.*>)//g;
    $_ =~ s/\r//g;
    $_ =~ s/^\s//g;
    push @txtfile,$_;
    close(XMLFILE);
  }# While XML loop
}#if XML loop
print @txtfile;#for debug only
[download]

My attempt: '-B $logfile'; evedently does not tell me the difference between UTF16 or UTF8... Since I only want to process the UTF-16 .xml files i need help with the syntax to identify the UTF-8 files and skip them... I know that I could use XML::simple or XML::parser, but I am attempting to use regex to accomplish this... This IF statement will basically be updating functionallity to existing script with out writing a whole new one...

Thank you for any help that you may provide...

DBrock...

Comment on how do I check encoding before opening FILEHANDLE Select or Download Code

Replies are listed 'Best First'.

Re: how do I check encoding before opening FILEHANDLE
by gaal (Parson) on Feb 17, 2005 at 20:10 UTC

Also: you can change the encoding of an open filehandle using the three-arg form of binmode.

So if you know how the algorithm to figure out the encoding, you can open in byte mode, check the data, and binmode with the appropriate encoding layer.

[reply]
[d/l]

Re^2: how do I check encoding before opening FILEHANDLE

by PodMaster (Abbot) on Feb 17, 2005 at 21:37 UTC

So if you know how the algorithm to figure out the encoding, you can open in byte mode, check the data, and binmode with the appropriate encoding layer.

Encode::Guess

<?xml version="1.0" encoding="ISO-8859-1"?>

MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
** The third rule of perl club is a statement of fact: pod is sexy.

[reply]
[d/l]

Re^3: how do I check encoding before opening FILEHANDLE

by gaal (Parson) on Feb 18, 2005 at 08:41 UTC

Well, it's easy but not utterly trivial, because if you open the file in the wrong encoding and compare stings or try a regexp, the comparison will fail. (And if you claim utf8 and it isn't, you have to catch an exception that will be raised).

[reply]

Re: how do I check encoding before opening FILEHANDLE
by cowboy (Friar) on Feb 17, 2005 at 20:25 UTC

eval {
 open(XMLFILE, '<:encoding(utf16)', $logfile) or die $!;
};
if ($@) {
  # whatever will match your error message,
  # just skip the file.
  next if $@ =~ /^:BOM error/;

  # unexpected message, bail out.
  die $@;
}
[download]

[reply]
[d/l]

Re^2: how do I check encoding before opening FILEHANDLE

by dbrock (Sexton) on Feb 17, 2005 at 21:10 UTC

I have attempted to try this approach so far no luck ... Thank you for your time... DBrock...

[reply]

Re: how do I check encoding before opening FILEHANDLE
by graff (Chancellor) on Feb 18, 2005 at 08:45 UTC

So, given that limitation, my suggestion would be to try to determine whether the UTF16 files always start with a byte-order mark; on a windows (little-endian) box, the UTF16 will doubtless be little-endian, and the byte-order mark, if present, will be a 16-bit unsigned integer with the value 0xfeff.

Any UTF16_LE file that starts with a byte-order mark will pass the following test:

my $bom;
open IN, $filename or die $!;
read IN, $bom, 2;

my $bomval = unpack 'S', $bom;

if ( $bomval == 0xfeff ) {
   # this is bound to be a utf16 file --
   # you've already read the bom, so just move on and read the data
}
close IN;
[download]

my $size = -s $filename;
$size = 128 if $size > 128;

my $test;
open IN, $filename or die $!;
read IN, $test, $size;
seek IN, 0, 0;

my @bytes = unpack 'C*', $test;
my $nullhibytes = 0;
for ( my $i=o $i<$size; $i+=2 ) {
    $nullhibytes++ if ( $bytes[$i+1] == 0 and $bytes[$i] =~ /[ -~]/);
}

if ( $nullhibytes > 8 ) { 
    # this is probably a utf16 file (if it's text at all)
}
close IN;
[download]

(Are you not able to install current versions of the XML modules, for the same reason you can't use perl 5.8.x? The XML::Parser version I have allows for reading UTF16 data straight from disk, just by setting an initial parameter for the parser object.)