comment on

Dear Monks,

As part of a larger app, I am fetching a web page, parsing it with HTML::TreeBuilder, and saving both the original and the parsed data in an XML file as CDATA. Before passing the data to TreeBuilder, I convert it to perl's internal encoding. When exporting the 'original' data, I use the "encoding" attribute on an XML container to specify the encoding of the data.

My problem is that when I try to load the data with LibXML, in some cases (e.g. http://www.w3.org/Press/1998/XSL-WD.html.ja), the parser seems to interpret some foreign character as "END CDATA", and then I get an incredible slew of parser errors. The following code illustrates this by way of example:

!/usr/bin/perl -w

use strict;

use LWP::UserAgent;
use XML::Generator;
use XML::LibXML;


###############################
## PHASE 1: generate XML file

my $url = 'http://www.w3.org/Press/1998/XSL-WD.html.ja';
my $file = 'test';
my $ua = LWP::UserAgent->new();

my ($success, $response, $content_type, $charset, %encoding_opts);
$success = ($response = $ua->get($url))->is_success();

die "Couldn't fetch URL: '$url'" unless $success;

$content_type = $response->header('Content-Type');
$content_type =~ /charset\s*=\s*([A-Za-z0-9_\-]+)/io if $content_type;
$charset = $1 || undef;
 
# HTTP::Message doesn't always seem to recognize Content-Type correctl
+y, override
if ($charset) {
  $encoding_opts{charset} = $charset;
}
 
my $decoded = $response->decoded_content(%encoding_opts);
die "Cannot decode content: ". $@ unless $decoded;

my $gen = XML::Generator->new(pretty => 2, conformance => 1);

my $xml = $gen->xml(
  $gen->parsed($gen->xmlcdata($decoded)), 
  $gen->original({encoding => $charset}, $gen->xmlcdata($response->con
+tent()))
);


open(FH, '>:utf8', $file) or die "Couldn't write to file: '$file'";
print FH $xml;
close(FH);


###############################
## PHASE 2: load it

my $parser = XML::LibXML->new();
my $doc = $parser->parse_file($file);
[download]

When I execute, I get the following errors:

test:129: parser error : CData section not finished
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transi
    <ACRONYM lang="en" title="eXtensible Style Language">XSL</ACRONYM>
+ 1.0 ESC$B$N

... 600+ more lines omitted ...
[download]

(My terminal can't do japanese, and the 'ESC' above is the character the parser complains at).
I'm sure I'm missing something obvious, and that it must be entirely possible to include any chunk of data in any encoding in an XML file (even though that may not always be wise). I would be grateful for any help.

-- telcontar

In reply to Exporting HTML in an XML document by telcontar

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.