comment on

I'm writing a program that needs to extract a clump of XML metadata stored inside of a noncompliant HTML file and then perform a number of operations on that metadata. (Specifically, for those curious, this is part of a Mobipocket .prc to IPDF .epub ebook converter.)

The HTML file in question has no doctype declaration, and XHTML entities may be found in the metadata portion. In particular, © is the first entity that XML::Parser will choke on in my current test data.

An example of the type of non-HTML I'm dealing with:

<html><head><metadata> <dc-metadata xmlns:dc="http://purl.org/metadata
+/dublin_core" xmlns:oebpackage="http://openebook.org/namespaces/oeb-p
+ackage/1.0/"> <dc:title>An Example Title</dc:title> <dc:creator role=
+"aut">Firstname Lastname</dc:creator> <dc:publisher>PerlMonks</dc:pub
+lisher> <dc:rights>Copyright &copy; 2008 AZed</dc:rights> <dc:descrip
+tion>Science/Technical. 2 words long. First published at PerlMonks, S
+eptember 2008</dc:description> <dc:language id="en-us">English</dc:la
+nguage> <dc:type>Short Story</dc:type> <dc:format>text/xml</dc:format
+> </dc-metadata> </metadata>  <metadata filepos="0000031431" href="xy
+z_metadata.htm"></metadata></head><body><p>Hi there!</p></body></html
+>
[download]

Could someone please provide me with an example of how to get XML::Twig to recognize XHTML entities? (Or even just © to get me started?) I'm currently using a workaround involving slurping the input file and using a regular expression to split the metadata out into a temporary file with a proper XML and doctype declaration prepended, but it's something of an evil hack, given that I have to just read the results of that back into XML::Twig anyway, and these files can be several megabytes in size, making slurping a very costly technique.

The XML::Twig documentation implies that subelement extraction of this nature should be fairly low-cost, so I'm hoping that someone can work this out. If it isn't, I may simply have to experiment with using sysread to chop it up into manageable chunks -- the metadata is always at the beginning of the file, and in theory should never exceed 10k in size.

My last attempt at getting XML::Twig to read this looks like this:

    $mobihtmltwig = XML::Twig->new( 
        load_DTD => 1, 
        twig_roots => { 'metadata' => 1 }, 
        twig_handlers => { 'metadata' => \&twig_cut_metadata }, 
        output_encoding => 'utf8', 
        pretty_print => 'indented', 
        twig_print_outside_roots => 'HTML' 
        ); 
 
    $mobihtmltwig->set_doctype( 
        'package', 
        "http://openebook.org/dtds/oeb-1.2/oebpkg12.dtd", 
        "+//ISBN 0-9673008-1-9//DTD OEB 1.2 Package//EN"); 
 
    $mobihtmltwig->entity_list->add_new_ent(copy => "&#169;"); 
 
    print $mobihtmltwig->entity_names,"\n"; 
 
    $mobihtmltwig->parsefile($mobihtmlfile);
[download]

It dies at the parsefile command with:

undefined entity at line 1, column 306, byte 306 at /usr/lib/perl5/XML/Parser.pm line 187

Byte 306 is the first ©. This is despite 'copy' being present in the entity list and showing up when printing $mobihtmltwig->entity_names.

Thanks for any help.

In reply to XML::Twig doctype and entity handling by AZed

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.