XML::Twig doctype and entity handling

AZed has asked for the wisdom of the Perl Monks concerning the following question:

I'm writing a program that needs to extract a clump of XML metadata stored inside of a noncompliant HTML file and then perform a number of operations on that metadata. (Specifically, for those curious, this is part of a Mobipocket .prc to IPDF .epub ebook converter.)

The HTML file in question has no doctype declaration, and XHTML entities may be found in the metadata portion. In particular, © is the first entity that XML::Parser will choke on in my current test data.

An example of the type of non-HTML I'm dealing with:

<html><head><metadata> <dc-metadata xmlns:dc="http://purl.org/metadata
+/dublin_core" xmlns:oebpackage="http://openebook.org/namespaces/oeb-p
+ackage/1.0/"> <dc:title>An Example Title</dc:title> <dc:creator role=
+"aut">Firstname Lastname</dc:creator> <dc:publisher>PerlMonks</dc:pub
+lisher> <dc:rights>Copyright &copy; 2008 AZed</dc:rights> <dc:descrip
+tion>Science/Technical. 2 words long. First published at PerlMonks, S
+eptember 2008</dc:description> <dc:language id="en-us">English</dc:la
+nguage> <dc:type>Short Story</dc:type> <dc:format>text/xml</dc:format
+> </dc-metadata> </metadata>  <metadata filepos="0000031431" href="xy
+z_metadata.htm"></metadata></head><body><p>Hi there!</p></body></html
+>
[download]

Could someone please provide me with an example of how to get XML::Twig to recognize XHTML entities? (Or even just © to get me started?) I'm currently using a workaround involving slurping the input file and using a regular expression to split the metadata out into a temporary file with a proper XML and doctype declaration prepended, but it's something of an evil hack, given that I have to just read the results of that back into XML::Twig anyway, and these files can be several megabytes in size, making slurping a very costly technique.

The XML::Twig documentation implies that subelement extraction of this nature should be fairly low-cost, so I'm hoping that someone can work this out. If it isn't, I may simply have to experiment with using sysread to chop it up into manageable chunks -- the metadata is always at the beginning of the file, and in theory should never exceed 10k in size.

My last attempt at getting XML::Twig to read this looks like this:

    $mobihtmltwig = XML::Twig->new( 
        load_DTD => 1, 
        twig_roots => { 'metadata' => 1 }, 
        twig_handlers => { 'metadata' => \&twig_cut_metadata }, 
        output_encoding => 'utf8', 
        pretty_print => 'indented', 
        twig_print_outside_roots => 'HTML' 
        ); 
 
    $mobihtmltwig->set_doctype( 
        'package', 
        "http://openebook.org/dtds/oeb-1.2/oebpkg12.dtd", 
        "+//ISBN 0-9673008-1-9//DTD OEB 1.2 Package//EN"); 
 
    $mobihtmltwig->entity_list->add_new_ent(copy => "&#169;"); 
 
    print $mobihtmltwig->entity_names,"\n"; 
 
    $mobihtmltwig->parsefile($mobihtmlfile);
[download]

It dies at the parsefile command with:

undefined entity at line 1, column 306, byte 306 at /usr/lib/perl5/XML/Parser.pm line 187

Byte 306 is the first ©. This is despite 'copy' being present in the entity list and showing up when printing $mobihtmltwig->entity_names.

Thanks for any help.

Comment on XML::Twig doctype and entity handling Select or Download Code

Replies are listed 'Best First'.
Re: XML::Twig doctype and entity handling by mirod (Canon) on Sep 08, 2008 at 13:41 UTC
The problem is that your XML is not well-formed. So the parser dies. That's what it is supposed to do. Setting the entity or the doctype in XML::Twig doesn't work, because the parser (expat) is at a lower level. You should include the DTD declaration in your documents, staring them with `<!DOCTYPE package PUBLIC "+//ISBN 0-9673008-1-9//DTD OEB 1.2 Package//EN">`. You can do this on-the-fly BTW by opening the file through a pipe (`open( my $package_fh, 'cat dtd_declaration real_file.xml'); $twig->parse( $package_fh);`). In fact it doesn't even matter whether the DTD is available or not, as expat will gladly ignore it (as a result of course the entities will not be expanded).	[reply] [d/l] [select]
Re^2: XML::Twig doctype and entity handling by AZed (Monk) on Sep 08, 2008 at 18:40 UTC
Ah, open-as-pipe does take care of the problem, thanks. I had been hoping that setting a doctype via doctype() would have prepended the assigned doctype declaration to the input, but if it doesn't, it doesn't. Unfortunately, it looks like the parser will try to handle tags outside of the twig roots anyway, meaning that even though the `<metadata>...</metadata>` clump that I want to work with is well-formed, the parser will still die before the twig I need is returned because the junk surrounding it is not. Amusingly, this technique does work to split out the HTML without the `<metadata>` elements, because `twig_print_outside_roots` will finish before the parser dies from mismatched tags as the text ends. Sysread it is, then. Thanks, again.	[reply] [d/l] [select]
Re: XML::Twig doctype and entity handling by Anonymous Monk on Sep 07, 2008 at 21:32 UTC
Maybe you need to use option expand_external_ents/set_expand_external_entities	[reply]
Re^2: XML::Twig doctype and entity handling by AZed (Monk) on Sep 07, 2008 at 22:02 UTC
You mean, setting `expand_external_ents => -1,`? Interesting thought, though I'm not sure what would happen if it ever got to the stage of being able to print the output. Unfortunately, however, that remains true even after trying it -- the parser still dies at the same spot even with it set to -1. As an aside, `keep_encoding => 1` doesn't help, either.	[reply] [d/l] [select]