Hey Clint,
The answer to your first question is that XML::LibXML will not parse tag soups in recover mode. Therefore, HTML::Parser is a better solution for dodgy HTML.
The idea of XML::LibXML's recover mode is to recover XML malformed documents up to the point when non-XML code starts. This is usually a closed tag in a location where the parser would not expect it. In recover mode XML::LibXML will return the code that has been parsed successfully, instead of throwing an error. Anything that comes behind that error is ignored.
Your second question has several answers and is quite complex.
DocumentFragments cannot exist without an owning document when using XML::LibXML.
That means that you have to create a temporary document in order to create a document fragment. The following code line provides an example.
my $tempdoc = XML::LibXML->createDocument($version,$encoding);
my $docfrag = $tempdoc->createDocumentFragment();
You should also use DOM functions create element nodes, text nodes etc. instead of
my $element = XML::LibXML::Element->new( $tag );
The following line is the most generic way following the DOM paradigm.
my $element = $self->{ _HXS_parent }->ownerDocument->createElement($
+tag);
If $tempdoc is available, the statement can be simplified to the following line.
my $element = $tempdoc->createElement($tag);
As a result of this process either the result document or the document fragment will be returned.
Although this is the answer to your second question, I would suggest a different approach. Instead of DOM functions to create the result data, use a SAX pipeline to handle the HTML tokens. In that case all the tricky DOM building is left to the good old XML::LibXML::SAX::Builder which comes with XML::LibXML (or to any other SAX handler such as XML::SAX::Writer).
The idea behind SAX is that any token is defined by two boundaries (which are the tags in XML). A SAX parser indicates these boundaries by sending "start_element" and "end_element" events to a SAX pipeline. When a SAX handler in the pipeline receives these events, the handler translates them into whatever is appropriate (e.g. create DOM nodes or write XML data to an IO stream). Although this sounds complex, for the parser side XML::SAX::Base does all the abstraction, so token handling can be reduced (more or less) to the following code:
@MYCLASS::ISA = qw(XML::SAX::Base);
sub static_handler {
$self->set_handler(XML::LibXML::SAX::Builder->new());
}
sub result {
$self->get_handler()->result(); # will return a DOM structure
}
sub found_token {
my ($self, $tag, $attributes) = @_;
$self->check_active_tokens(); # and end them if required
my $e = _element($tag);
for my $key ( keys %$attributes ) {
_add_attribute($e, $key, $attributes->{$key} );
}
$self->start_element($e);
}
sub check_active_token {
# this allows no nested tags.
# tag soup logic should end up in this function
foreach my $tag ( @{$self->{_TOKENS_}} ) {
$self->end_element( _element($tag, 1) );
}
}
# the following functions I stole from Matt's code ;)
# it simplifies the handling of SAX data structures for
# those cases that don't require namespace handling.
sub _element {
my ($name, $end) = @_;
return {
Name => $name,
LocalName => $name,
$end ? () : (Attributes => {}),
NamespaceURI => '',
Prefix => '',
};
}
sub _add_attrib {
my ($el, $name, $value) = @_;
$el->{Attributes}{"{}$name"} =
{
Name => $name,
LocalName => $name,
Prefix => "",
NamespaceURI => '',
Value => $value,
};
return $el;
}
Of course the application logic of the final code will be more complex. But this outline shows the principle.
The advantage of this approach is that you can avoid most recursions that are necessary for DOM building from semistructured data. For instance, I use this approach to parse WIKI code into DOM structures.
Christian
--
PHISH @ CPAN |