Re^4: How to Truncate Corrupt Document.xml Files?

I read that the SAX parser is not so good for rebuilding the XML document which is what I want to do, unless I use 2 parsing instances, one as a SAX parser to analyze the document.xml file and the other with XML::Parser to actually add the intended end tags and rebuild the document.xml.

However is there any real benefit to this use of SAX? Can't I just define say a start handler with XML::Parser that adds non self ending tags to an array and then define an end handler that removes tags from the same array. Then maybe at the end of parsing all that would be left in the array would be those tags not found by the end handler and these tags could be added to the end of the xml file in reverse order with last in first out?

Comment on Re^4: How to Truncate Corrupt Document.xml Files?

Replies are listed 'Best First'.
Re^5: How to Truncate Corrupt Document.xml Files? by educated_foo (Vicar) on Feb 16, 2012 at 04:36 UTC
Can't I just define say a start handler with XML::Parser that adds non self ending tags to an array and then define an end handler that removes tags from the same array. Then maybe at the end of parsing all that would be left in the array would be those tags not found by the end handler and these tags could be added to the end of the xml file in reverse order with last in first out? That's basically what I was trying to suggest. SAX is one common stream-based parser that people coming from a non-Perl backgrounds might know. XML::Parser is another stream-based parser which is, IMHO, easier to use.	[reply]
Re^6: How to Truncate Corrupt Document.xml Files? by socrtwo (Sexton) on Feb 16, 2012 at 18:08 UTC
I constructed the beginnings of a script that is supposed to keep a running total non-ended tags with the XML::PARSER. The problem is that XML::PARSER errors out when XML is defective, which is exactly when I want the the rest of the script to work. So I'm assuming that I have to switch to SAX so the script will run as a stream and add and subtract to the array until it hits XML corruption as you were originally suggesting I expect. So here's the script with XML::PARSER that doesn't run when validation problems exist. When they don't exist it returns nothing for the @tags array which should be correct. #!/usr/bin/perl use XML::Parser; use strict; my $xml_file = $ARGV[0]; my $parser = new XML::Parser; $parser->setHandlers( Start => \&start_tag_handler, End => \&end_tag_handler, ); $parser->parsefile($xml_file); my @tags; sub start_tag_handler { my $p = shift; my $element = shift; my $parent = $p->current_element; my $realtag = "$parent::$element"; push(@tags, $realtag); } sub end_tag_handler { my $p2 = shift; my $element2 = shift; my $parent2 = $p2->current_element; my $realtag2 = "$parent2::$element2"; my $index = 0; $index++ until $tags[$index] eq "$realtag2"; splice(@tags, $index, 1); } open (MYFILE, '>data.txt'); print MYFILE "Tags in the array are @tags\n"; close (MYFILE); [download] Update: On another crucial for me subject I'd expect...why are externally initiated arrays available outside a subroutine like the @tags available above in a script but not in a module like below?: `package truncator; require 5.005_62; use strict; use XML::SAX::Base; our @ISA = ('XML::SAX::Base'); our $VERSION = '0.01'; my @tags; sub new { my ($type) = @_; return bless {}, $type; } my $current_element = ''; sub start_element { my ($self, $element) = @_; $current_element = $element->{Name}; push(@tags, $current_element); } print @tags; 1;` [download] The print @tags line doesn't return anything when outside the subroutine, but it would if it were in a script. Update It looks like I was reinventing the wheel. Xmllint will reliably putting the correct ending tags on corrupt XML with `--recover` command. I did find a case though where its truncation and ending tag solutions didn't suit MS Word. So what I want to do know is figure out how to truncate an XML file a configurable amount of characters before the first error, and then apply the command line `xmllint --recover`.	[reply] [d/l] [select]


There's more than one way to do things
	PerlMonks