comment on

The previous two replies, taken together, provide the right answers for doing both XML tag removal and character conversion out of UTF-16. But I think it's important to draw attention to a couple more details, by way of explanation.

To say that the code you posted is "successful" at stripping out the XML tags is to stretch the definition of "success" a bit, to include results like "yeah, some of the data is missing too, but all the XML tags are gone, so that's success!"

In this line of your code:

    s/^.*(<.*>)//g;
[download]

the final "g" is superfluous -- never has any effect -- because there is a greedy match removing everything from the first "<" to the last ">" in a given string. So a line of data like this:

<tag1> data <tag2> more data </tag2><tag2> even more data </tag2></tag
+1>
[download]

will need just one application of your regex to end up as an empty string (regardless of whether it's UTF-16 or whatever). Maybe you think you know your particular XML data well enough that your chosen heuristic will work okay. But someday you'll get some XML data that will break it. That's why you should use an XML module to handle XML data; code based on a module will work on all XML data.

There might be better solutions than the XML::Simple::XMLin method suggested above; for example, you could use XML::Parser like this if you just want to strip off the tagging:

#!/usr/bin/perl

use strict;
use warnings;
use XML::Parser;

die "Usage: $0 file.xml\n" unless ( @ARGV == 1 and -f $ARGV[0] );

my $parser = new XML::Parser( Handlers => { Char => \&print_chars },
                              ProtocolEncoding => 'UTF-16',
                             );

$parser->parsefile( $ARGV[0] );

sub print_chars
{
    print pop;
}
[download]

That's all there is to it. Notice the part that says what sort of input character encoding to use. As the data file gets read in, it is converted internally from utf-16 to utf8, and will be printed as utf8 -- and if there happen to be no "wide" (non-ascii) characters in your data files, conversion to utf8 really means conversion to ascii, because ascii is a proper subset of the utf8 character set.

As for the difference between utf-16 and utf8, it's really a very simple matter for data that consists only of characters in the ascii range: for every single-byte ascii character, attach a null high byte (e.g. \x40 becomes \x{0040}) and voila! the result is the corresponding utf-16 character code.

The reason you were failing to get rid of the unwanted high-bytes, I think, is that you mistakenly thought these were spaces instead of null bytes. (I think the Windows "MS-DOS Prompt" window app normally replaces non-displayable character codes -- or at least null bytes -- with spaces.)

Anyway, it's better to use Perl's encoding tools for doing character conversions, just like it's better to use XML modules for parsing XML.

In reply to Re: Decoding UTF-16 to ASCII by graff
in thread Decoding UTF-16 to ASCII by dbrock

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.