To say that the code you posted is "successful" at stripping out the XML tags is to stretch the definition of "success" a bit, to include results like "yeah, some of the data is missing too, but all the XML tags are gone, so that's success!"
In this line of your code:
the final "g" is superfluous -- never has any effect -- because there is a greedy match removing everything from the first "<" to the last ">" in a given string. So a line of data like this:s/^.*(<.*>)//g;
will need just one application of your regex to end up as an empty string (regardless of whether it's UTF-16 or whatever). Maybe you think you know your particular XML data well enough that your chosen heuristic will work okay. But someday you'll get some XML data that will break it. That's why you should use an XML module to handle XML data; code based on a module will work on all XML data.<tag1> data <tag2> more data </tag2><tag2> even more data </tag2></tag +1>
There might be better solutions than the XML::Simple::XMLin method suggested above; for example, you could use XML::Parser like this if you just want to strip off the tagging:
That's all there is to it. Notice the part that says what sort of input character encoding to use. As the data file gets read in, it is converted internally from utf-16 to utf8, and will be printed as utf8 -- and if there happen to be no "wide" (non-ascii) characters in your data files, conversion to utf8 really means conversion to ascii, because ascii is a proper subset of the utf8 character set.#!/usr/bin/perl use strict; use warnings; use XML::Parser; die "Usage: $0 file.xml\n" unless ( @ARGV == 1 and -f $ARGV[0] ); my $parser = new XML::Parser( Handlers => { Char => \&print_chars }, ProtocolEncoding => 'UTF-16', ); $parser->parsefile( $ARGV[0] ); sub print_chars { print pop; }
As for the difference between utf-16 and utf8, it's really a very simple matter for data that consists only of characters in the ascii range: for every single-byte ascii character, attach a null high byte (e.g. \x40 becomes \x{0040}) and voila! the result is the corresponding utf-16 character code.
The reason you were failing to get rid of the unwanted high-bytes, I think, is that you mistakenly thought these were spaces instead of null bytes. (I think the Windows "MS-DOS Prompt" window app normally replaces non-displayable character codes -- or at least null bytes -- with spaces.)
Anyway, it's better to use Perl's encoding tools for doing character conversions, just like it's better to use XML modules for parsing XML.
In reply to Re: Decoding UTF-16 to ASCII
by graff
in thread Decoding UTF-16 to ASCII
by dbrock
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |