AlexTape has asked for the wisdom of the Perl Monks concerning the following question:

Dear omniscient monks,

i got some html/tei like data and want to parse it to xml format. it is working pretty well for some files.. but not for all.. here is my code:
# pragma use strict; use warnings; # modules use XML::Simple; use XML::Tidy; use Data::Dumper; use Data::Diver qw( Dive DiveRef DiveError ); use HTML::TreeBuilder; use XML::Tidy::Tiny; # little helper use constant false => 0; use constant true => 1; ... # get instance of treebuilder my $root = HTML::TreeBuilder->new(); # configure treebuilder $root->ignore_unknown( false ); # dump data to the treebuilder $root->parse( $fileData ); # get name for target file my $target = $file; $target =~ s/$fileExtension$/xml/; # open output filehandle open( $FH, '>', $target ); # configure output binmode $FH, ":utf8"; # ERROR HERE 208: my $data = $root->guts()->as_XML(); print $FH xml_tidy( $data ); close $FH; ...
caption has an invalid attribute name 'n' at script.pl line 208
i substite all 'n' in the file.. but got still the same error. for that the 'n' is not the anchor of this error.. i dont know what going on here?!
$root->guts()
is okey.. it is all about the ->as_XML() :-((

kindly, perlig

$perlig =~ s/pec/cep/g if 'errors expected';

Replies are listed 'Best First'.
Re: HTML::TreeBuilder, HTML::Element, as_XML()
by Jenda (Abbot) on May 23, 2013 at 15:16 UTC

    IMnsHO, there is a bug in the _valid_name subroutine deep in HTML::Element. There should be

    return (0) unless ( $attr =~ /^$START_CHAR$NAME_CHAR*$/ );
    not
    return (0) unless ( $attr =~ /^$START_CHAR$NAME_CHAR+$/ );

    The XML specs say that

    Name ::= NameStartChar (NameChar)*

    Jenda
    Enoch was right!
    Enjoy the last years of Rome.

      IMnsHO, there is a bug in the _valid_name subroutine deep in HTML::Element. There should be

      I wouldn't go that far , the OP provides no data

        The OP doesn't need to provide data, the code doesn't match the specs linked five lines above the code in question.

        Jenda
        Enoch was right!
        Enjoy the last years of Rome.

Re: HTML::TreeBuilder, HTML::Element, as_XML()
by ambrus (Abbot) on May 24, 2013 at 13:07 UTC

    Look in the implementation of XML::Twig for the workarounds it uses when HTML::Tree's as_XML method dies.