comment on

XML::LibXML is probably the way to go here, but here is an attempt using XML::Parser. The idea is just to automate the cycle "run parser - see it die - fix error" until the document passes. So the code runs XML::Parser, traps the error message, fix the original document and re-try, until no error message is found or the last error message is repeated, in which case it came accross an error that it could not fix. This is probably too slow to process an 80M file missing a lot of tags, but it is correct, as in "no XML weirndess is going to trip it", and could be extended to fix other types of errors.

#!/usr/bin/perl -w

use strict;
use XML::Parser;

my $file= 'crap.xml';
my $fixes=0;

my @tags; # stack of tags used to figure out the last non closed tag

my $p= XML::Parser->new( Handlers => { Start => sub { push @tags, $_[1
+]; },
                                       End   => sub { pop  @tags;     
+   },
                                     },
                         ErrorContext => 1,
                       );

my( $error, $last_error);

do
  { $last_error= $error||'';
    undef $@;
    eval{ $p->parsefile( $file); };
    #warn "error: $@ => close $tags[-1]\n" if( $@ && ($@ ne $last_erro
+r));
    if( $@=~ m{^\s*mismatched tag at line (\d+), column (\d+)})
      { close_tag( $file, $tags[-1], $1, $2); $fixes++; }
    # you could add other types of fixes below
  } until( !$@ || ($@ eq $last_error));

if( $@)
  { print "could not fix the file: $@\n"; }
else
  { print "success! ($fixes tags fixed)\n"; }
  
sub close_tag
  { my( $file, $tag, $line, $column)= @_;
    my $temp= "crap.new"; 
    open( my $in,  '<', $file) or die "cannot open file (r) '$file': $
+!\n";
    open( my $out, '>', $temp) or die "cannot open file (w) '$temp': $
+!\n";
    # print the beginning of the file (untouched) 
    for (1..$line-1) { print {$out} scalar <$in>; }
    # close the tag
    my $faulty_line=<$in>;
    # the reported column seems to be off by 3, but I suspect this mig
+ht
    # vary depending on the xml prefix, so this looks safer
    my $real_column= rindex( $faulty_line, '<', $column) - 1; 
    substr( $faulty_line, $real_column, 0)= "</$tag>\n";
    print {$out} $faulty_line;
    # finish printing
    while( <$in>) { print {$out} $_; }
    close $in; close $out;
    rename $temp, $file or die "cannot replace file '$file' by new ver
+sion in '$temp'";
  }
[download]

In reply to Re: Repair malformed XML by mirod
in thread Repair malformed XML by spoulson

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.