Exporting HTML in an XML document

telcontar has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

As part of a larger app, I am fetching a web page, parsing it with HTML::TreeBuilder, and saving both the original and the parsed data in an XML file as CDATA. Before passing the data to TreeBuilder, I convert it to perl's internal encoding. When exporting the 'original' data, I use the "encoding" attribute on an XML container to specify the encoding of the data.

My problem is that when I try to load the data with LibXML, in some cases (e.g. http://www.w3.org/Press/1998/XSL-WD.html.ja), the parser seems to interpret some foreign character as "END CDATA", and then I get an incredible slew of parser errors. The following code illustrates this by way of example:

!/usr/bin/perl -w

use strict;

use LWP::UserAgent;
use XML::Generator;
use XML::LibXML;


###############################
## PHASE 1: generate XML file

my $url = 'http://www.w3.org/Press/1998/XSL-WD.html.ja';
my $file = 'test';
my $ua = LWP::UserAgent->new();

my ($success, $response, $content_type, $charset, %encoding_opts);
$success = ($response = $ua->get($url))->is_success();

die "Couldn't fetch URL: '$url'" unless $success;

$content_type = $response->header('Content-Type');
$content_type =~ /charset\s*=\s*([A-Za-z0-9_\-]+)/io if $content_type;
$charset = $1 || undef;
 
# HTTP::Message doesn't always seem to recognize Content-Type correctl
+y, override
if ($charset) {
  $encoding_opts{charset} = $charset;
}
 
my $decoded = $response->decoded_content(%encoding_opts);
die "Cannot decode content: ". $@ unless $decoded;

my $gen = XML::Generator->new(pretty => 2, conformance => 1);

my $xml = $gen->xml(
  $gen->parsed($gen->xmlcdata($decoded)), 
  $gen->original({encoding => $charset}, $gen->xmlcdata($response->con
+tent()))
);


open(FH, '>:utf8', $file) or die "Couldn't write to file: '$file'";
print FH $xml;
close(FH);


###############################
## PHASE 2: load it

my $parser = XML::LibXML->new();
my $doc = $parser->parse_file($file);
[download]

When I execute, I get the following errors:

test:129: parser error : CData section not finished
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transi
    <ACRONYM lang="en" title="eXtensible Style Language">XSL</ACRONYM>
+ 1.0 ESC$B$N

... 600+ more lines omitted ...
[download]

(My terminal can't do japanese, and the 'ESC' above is the character the parser complains at).
I'm sure I'm missing something obvious, and that it must be entirely possible to include any chunk of data in any encoding in an XML file (even though that may not always be wise). I would be grateful for any help.

-- telcontar

Comment on Exporting HTML in an XML document Select or Download Code

Replies are listed 'Best First'.
Re: Exporting HTML in an XML document by lukeyboy1 (Beadle) on Nov 19, 2007 at 10:00 UTC
This sounds like it's an XML encoding issue. I've been using this line at the top of my XML: <?xml version="1.0" encoding="ISO-8859-1"?>. I imagine that this could be set in the constructor, e.g. "encoding" => "ISO-8859-1".	[reply]
Re^2: Exporting HTML in an XML document by telcontar (Beadle) on Nov 19, 2007 at 10:36 UTC
The native / default encoding for XML is UTF-8. If you look at my code, you'll see that it attempts to determine the charset of the HTML code, and that when it exports the "original" code in XML, the "encoding" attribute is set to that character set in the `<original>` element. I've ended up decoding that chunk of HTML into UTF-8 and exporting it that way as well. Any attempts to do this with arbitrary data with non-UTF8 charsets have failed. --telcontar	[reply] [d/l]
Re^2: Exporting HTML in an XML document by telcontar (Beadle) on Nov 19, 2007 at 10:43 UTC
If anyone else reads this and has run into a similar problem, I suppose another way of getting around this would to Base-64 encode the data. That'd solve all charset and encoding problems right there. Unfortunately, it also adds 33% space overhead (don't care) and decoding overhead when the file's loaded (more of a problem). -- telcontar	[reply]