in reply to Repair malformed XML

XML::LibXML is probably the way to go here, but here is an attempt using XML::Parser. The idea is just to automate the cycle "run parser - see it die - fix error" until the document passes. So the code runs XML::Parser, traps the error message, fix the original document and re-try, until no error message is found or the last error message is repeated, in which case it came accross an error that it could not fix. This is probably too slow to process an 80M file missing a lot of tags, but it is correct, as in "no XML weirndess is going to trip it", and could be extended to fix other types of errors.

#!/usr/bin/perl -w use strict; use XML::Parser; my $file= 'crap.xml'; my $fixes=0; my @tags; # stack of tags used to figure out the last non closed tag my $p= XML::Parser->new( Handlers => { Start => sub { push @tags, $_[1 +]; }, End => sub { pop @tags; + }, }, ErrorContext => 1, ); my( $error, $last_error); do { $last_error= $error||''; undef $@; eval{ $p->parsefile( $file); }; #warn "error: $@ => close $tags[-1]\n" if( $@ && ($@ ne $last_erro +r)); if( $@=~ m{^\s*mismatched tag at line (\d+), column (\d+)}) { close_tag( $file, $tags[-1], $1, $2); $fixes++; } # you could add other types of fixes below } until( !$@ || ($@ eq $last_error)); if( $@) { print "could not fix the file: $@\n"; } else { print "success! ($fixes tags fixed)\n"; } sub close_tag { my( $file, $tag, $line, $column)= @_; my $temp= "crap.new"; open( my $in, '<', $file) or die "cannot open file (r) '$file': $ +!\n"; open( my $out, '>', $temp) or die "cannot open file (w) '$temp': $ +!\n"; # print the beginning of the file (untouched) for (1..$line-1) { print {$out} scalar <$in>; } # close the tag my $faulty_line=<$in>; # the reported column seems to be off by 3, but I suspect this mig +ht # vary depending on the xml prefix, so this looks safer my $real_column= rindex( $faulty_line, '<', $column) - 1; substr( $faulty_line, $real_column, 0)= "</$tag>\n"; print {$out} $faulty_line; # finish printing while( <$in>) { print {$out} $_; } close $in; close $out; rename $temp, $file or die "cannot replace file '$file' by new ver +sion in '$temp'"; }