dbrock has asked for the wisdom of the Perl Monks concerning the following question:

Hello... I have a question to ask... summerize - I have written a script that opens a utf-16 .xml file(s)... that strips out the XML tags and places the text from the XML into an @array for parsing... The XML files are from a backup software tool, the tool writes it's XML logfiles in both utf-16 and also in utf-8 (I need to skip the utf-8 files)the files are in the same directory using the same filename convention... I can already minipulate the utf-16 XML data as I desire however whan I try to process the source directory I also have the UTF-8 XML file there and my open FILEHANDLE function fails with a :BOM error when I attempt to open the UTF-8 files... On Chatter Box I have ask the question: how can I discover the encoding of a file before opening the file handle I have tried this using  next unless '-B $file'; however this does not work... I was pointed to a module (File::BOM) however this modile is not supported on the intel platform (perl 5.6.4 activestate)
if( $logfile =~ /.+\.xml/){ next unless '-B $logfile'; open(XMLFILE, '<:encoding(utf16)', $logfile)or die "Can't Open:$!"; while(<XMLFILE>) { $_ =~ s/^.*(<.*>)//g; $_ =~ s/\r//g; $_ =~ s/^\s//g; push @txtfile,$_; close(XMLFILE); }# While XML loop }#if XML loop print @txtfile;#for debug only
My attempt: '-B $logfile'; evedently does not tell me the difference between UTF16 or UTF8... Since I only want to process the UTF-16 .xml files i need help with the syntax to identify the UTF-8 files and skip them... I know that I could use XML::simple or XML::parser, but I am attempting to use regex to accomplish this... This IF statement will basically be updating functionallity to existing script with out writing a whole new one...

Thank you for any help that you may provide...

DBrock...

Replies are listed 'Best First'.
Re: how do I check encoding before opening FILEHANDLE
by gaal (Parson) on Feb 17, 2005 at 20:10 UTC
    You could peek at the source of the module to see how they do it.

    Also: you can change the encoding of an open filehandle using the three-arg form of binmode.

    So if you know how the algorithm to figure out the encoding, you can open in byte mode, check the data, and binmode with the appropriate encoding layer.

      So if you know how the algorithm to figure out the encoding, you can open in byte mode, check the data, and binmode with the appropriate encoding layer.
      Encode::Guess can be used to guess the actual encoding, but it shouldn't be neccessary (if you're lucky), as the xml file should specify the encoding, as in <?xml version="1.0" encoding="ISO-8859-1"?>

      MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
      I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
      ** The third rule of perl club is a statement of fact: pod is sexy.

        Well, it's easy but not utterly trivial, because if you open the file in the wrong encoding and compare stings or try a regexp, the comparison will fail. (And if you claim utf8 and it isn't, you have to catch an exception that will be raised).
Re: how do I check encoding before opening FILEHANDLE
by cowboy (Friar) on Feb 17, 2005 at 20:25 UTC
    One way to handle this, if the error is consistant, is to use an eval to trap it.
    eval { open(XMLFILE, '<:encoding(utf16)', $logfile) or die $!; }; if ($@) { # whatever will match your error message, # just skip the file. next if $@ =~ /^:BOM error/; # unexpected message, bail out. die $@; }
    (untested code)

    UPDATE: cleaned up the code a little for formatting reasons.
      I have attempted to try this approach so far no luck ... Thank you for your time... DBrock...
Re: how do I check encoding before opening FILEHANDLE
by graff (Chancellor) on Feb 18, 2005 at 08:45 UTC
    I'm not familiar with 5.6.x versions, but if you're stuck with 5.6.4, then I'm guessing that you don't have access to the PerlIO layers or the Encode module, which make it very easy to handle all forms of unicode (and most legacy character sets as well). Maybe you don't even have the "U" data type in your version of pack/unpack (for handling Unicode characters).

    So, given that limitation, my suggestion would be to try to determine whether the UTF16 files always start with a byte-order mark; on a windows (little-endian) box, the UTF16 will doubtless be little-endian, and the byte-order mark, if present, will be a 16-bit unsigned integer with the value 0xfeff.

    Any UTF16_LE file that starts with a byte-order mark will pass the following test:

    my $bom; open IN, $filename or die $!; read IN, $bom, 2; my $bomval = unpack 'S', $bom; if ( $bomval == 0xfeff ) { # this is bound to be a utf16 file -- # you've already read the bom, so just move on and read the data } close IN;
    If your data files don't start with a BOM, then maybe you can determine whether they contain any UTF16 characters in the ASCII range (these will have a high-byte of zero). In plain ASCII files and UTF8 files, you virtually never see null bytes; but to the extent that UTF16 files contain characters in the ASCII range, every other byte is null. So lacking a BOM, count null bytes:
    my $size = -s $filename; $size = 128 if $size > 128; my $test; open IN, $filename or die $!; read IN, $test, $size; seek IN, 0, 0; my @bytes = unpack 'C*', $test; my $nullhibytes = 0; for ( my $i=o $i<$size; $i+=2 ) { $nullhibytes++ if ( $bytes[$i+1] == 0 and $bytes[$i] =~ /[ -~]/); } if ( $nullhibytes > 8 ) { # this is probably a utf16 file (if it's text at all) } close IN;
    As for handling the XML tags, well, I'm not sure I understand what you're doing. But if your log files don't really contain character data outside the ASCII range (i.e. half the bytes in each file are null), then I'd say just strip out the null bytes and use XML::Parser or XML::Simple in the normal way.

    (Are you not able to install current versions of the XML modules, for the same reason you can't use perl 5.8.x? The XML::Parser version I have allows for reading UTF16 data straight from disk, just by setting an initial parameter for the parser object.)

      Thank you... I will try this... As for the XML tags, (Decoding UTF-16 to ASCII) I have attempted using the XML::Parser but I noticed that the my extract text is placed inside of a %Hash... I process from the rest of my script from a @Array... DBrock...