telcontar has asked for the wisdom of the Perl Monks concerning the following question:
Dear Monks,
As part of a larger app, I am fetching a web page, parsing it with HTML::TreeBuilder, and saving both the original and the parsed data in an XML file as CDATA. Before passing the data to TreeBuilder, I convert it to perl's internal encoding. When exporting the 'original' data, I use the "encoding" attribute on an XML container to specify the encoding of the data.
My problem is that when I try to load the data with LibXML, in some cases (e.g. http://www.w3.org/Press/1998/XSL-WD.html.ja), the parser seems to interpret some foreign character as "END CDATA", and then I get an incredible slew of parser errors. The following code illustrates this by way of example:
!/usr/bin/perl -w use strict; use LWP::UserAgent; use XML::Generator; use XML::LibXML; ############################### ## PHASE 1: generate XML file my $url = 'http://www.w3.org/Press/1998/XSL-WD.html.ja'; my $file = 'test'; my $ua = LWP::UserAgent->new(); my ($success, $response, $content_type, $charset, %encoding_opts); $success = ($response = $ua->get($url))->is_success(); die "Couldn't fetch URL: '$url'" unless $success; $content_type = $response->header('Content-Type'); $content_type =~ /charset\s*=\s*([A-Za-z0-9_\-]+)/io if $content_type; $charset = $1 || undef; # HTTP::Message doesn't always seem to recognize Content-Type correctl +y, override if ($charset) { $encoding_opts{charset} = $charset; } my $decoded = $response->decoded_content(%encoding_opts); die "Cannot decode content: ". $@ unless $decoded; my $gen = XML::Generator->new(pretty => 2, conformance => 1); my $xml = $gen->xml( $gen->parsed($gen->xmlcdata($decoded)), $gen->original({encoding => $charset}, $gen->xmlcdata($response->con +tent())) ); open(FH, '>:utf8', $file) or die "Couldn't write to file: '$file'"; print FH $xml; close(FH); ############################### ## PHASE 2: load it my $parser = XML::LibXML->new(); my $doc = $parser->parse_file($file);
(My terminal can't do japanese, and the 'ESC' above is the character the parser complains at).test:129: parser error : CData section not finished <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transi <ACRONYM lang="en" title="eXtensible Style Language">XSL</ACRONYM> + 1.0 ESC$B$N ... 600+ more lines omitted ...
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Exporting HTML in an XML document
by lukeyboy1 (Beadle) on Nov 19, 2007 at 10:00 UTC | |
by telcontar (Beadle) on Nov 19, 2007 at 10:36 UTC | |
by telcontar (Beadle) on Nov 19, 2007 at 10:43 UTC |