AZed has asked for the wisdom of the Perl Monks concerning the following question:
I'm writing a program that needs to extract a clump of XML metadata stored inside of a noncompliant HTML file and then perform a number of operations on that metadata. (Specifically, for those curious, this is part of a Mobipocket .prc to IPDF .epub ebook converter.)
The HTML file in question has no doctype declaration, and XHTML entities may be found in the metadata portion. In particular, © is the first entity that XML::Parser will choke on in my current test data.
An example of the type of non-HTML I'm dealing with:
<html><head><metadata> <dc-metadata xmlns:dc="http://purl.org/metadata +/dublin_core" xmlns:oebpackage="http://openebook.org/namespaces/oeb-p +ackage/1.0/"> <dc:title>An Example Title</dc:title> <dc:creator role= +"aut">Firstname Lastname</dc:creator> <dc:publisher>PerlMonks</dc:pub +lisher> <dc:rights>Copyright © 2008 AZed</dc:rights> <dc:descrip +tion>Science/Technical. 2 words long. First published at PerlMonks, S +eptember 2008</dc:description> <dc:language id="en-us">English</dc:la +nguage> <dc:type>Short Story</dc:type> <dc:format>text/xml</dc:format +> </dc-metadata> </metadata> <metadata filepos="0000031431" href="xy +z_metadata.htm"></metadata></head><body><p>Hi there!</p></body></html +>
Could someone please provide me with an example of how to get XML::Twig to recognize XHTML entities? (Or even just © to get me started?) I'm currently using a workaround involving slurping the input file and using a regular expression to split the metadata out into a temporary file with a proper XML and doctype declaration prepended, but it's something of an evil hack, given that I have to just read the results of that back into XML::Twig anyway, and these files can be several megabytes in size, making slurping a very costly technique.
The XML::Twig documentation implies that subelement extraction of this nature should be fairly low-cost, so I'm hoping that someone can work this out. If it isn't, I may simply have to experiment with using sysread to chop it up into manageable chunks -- the metadata is always at the beginning of the file, and in theory should never exceed 10k in size.
My last attempt at getting XML::Twig to read this looks like this:
$mobihtmltwig = XML::Twig->new( load_DTD => 1, twig_roots => { 'metadata' => 1 }, twig_handlers => { 'metadata' => \&twig_cut_metadata }, output_encoding => 'utf8', pretty_print => 'indented', twig_print_outside_roots => 'HTML' ); $mobihtmltwig->set_doctype( 'package', "http://openebook.org/dtds/oeb-1.2/oebpkg12.dtd", "+//ISBN 0-9673008-1-9//DTD OEB 1.2 Package//EN"); $mobihtmltwig->entity_list->add_new_ent(copy => "©"); print $mobihtmltwig->entity_names,"\n"; $mobihtmltwig->parsefile($mobihtmlfile);
It dies at the parsefile command with:
undefined entity at line 1, column 306, byte 306 at /usr/lib/perl5/XML/Parser.pm line 187
Byte 306 is the first ©. This is despite 'copy' being present in the entity list and showing up when printing $mobihtmltwig->entity_names.
Thanks for any help.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: XML::Twig doctype and entity handling
by mirod (Canon) on Sep 08, 2008 at 13:41 UTC | |
by AZed (Monk) on Sep 08, 2008 at 18:40 UTC | |
|
Re: XML::Twig doctype and entity handling
by Anonymous Monk on Sep 07, 2008 at 21:32 UTC | |
by AZed (Monk) on Sep 07, 2008 at 22:02 UTC |