in reply to Repair malformed XML
So here's how I would do it: read through the file one tag at a time (set the input-record-separator to ">"), maintain a stack of the open tags as they occur, pop them off the stack as their corresponding close-tags appear, and fill in missing close-tags where necessary.
(This way of reading might be noticeably inefficient on large data sets, and there are ways to chunk through the data with fewer iterations on the diamond operator; but see if it works well enough before thinking about optimizing it.)
I tested this on the examples you gave -- the good ones came out unaltered, and the bad one had the close-tag added where needed.#!/usr/bin/perl use strict; use warnings; $/ = '>'; # read up to the end of one tag at a time my $lineno = 0; my @tagstack; while (<>) { $lineno += tr/\n//; unless ( />$/ ) { # must be at eof; print; next; } my ( $pre, $tag ) = ( /([^<]*)(<[^>]+?)>/ ); if ( !defined( $tag )) { warn "'>' without prior '<' at or near line $lineno\n"; print; } elsif ( $tag =~ m{^</} ) { # close tag: look for its open tag on t +he stack unless ( @tagstack ) { warn "extra close tag '$tag' at or near line $lineno\n"; print; next; } $tag = substr $tag, 2; my $stackindx = $#tagstack; while ( $stackindx >= 0 and $tagstack[$stackindx] ne $tag ) { $stackindx--; } if ( $stackindx < 0 ) { warn "close tag '$tag' lacks open tag at or near line $lin +eno\n"; print; next; } print $pre; if ( $stackindx != $#tagstack ) { # add close tags as needed while ( $stackindx < $#tagstack ) { warn "added '</$tagstack[$#tagstack]>' at line $lineno +\n"; printf "</%s>\n", pop @tagstack; $lineno++; } } print "</$tag>"; pop @tagstack; } elsif ( $tag =~ m{^<!} or $tag =~ m{/$} ) { # "comment" or empty +tag print; } else { # this must be an open tag -- push it on the stack $tag =~ s/<(\S+).*/$1/s; push @tagstack, $tag; print; } }
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Repair malformed XML
by mirod (Canon) on Feb 04, 2005 at 09:11 UTC | |
by Anonymous Monk on Feb 04, 2005 at 11:27 UTC | |
|
Re^2: Repair malformed XML
by Anonymous Monk on Feb 04, 2005 at 11:38 UTC | |
by Anonymous Monk on Jun 23, 2016 at 21:13 UTC |