comment on

Perl Monks, I have a 13GB xml file that I'm trying to parse through using XML::Twig, and after getting about half way though the xml file (88,144,211 lines of 172,881,183) the parsing stopped and threw the error:

not well-formed (invalid token) at line 88144211, column 36, byte -1674366310 at /usr/perl5/site_perl/5.12/i86pc-solaris-64int/XML/Parser.pm line 187 at .../bin/xml_parser.pl line 54 at .../bin/xml_parser.pl line 54

I used sed to pull the not well formed line from the xml file, along with the preceeding 10 lines, which I've pasted here prefixed with the line numbers, and I think it's actually line 88144210 that caused my issue.

sed -n "88144200,88144211p;88144211q;" huge_xml_file.xml

88144200: <es:vsDataReportConfigSearch>
88144201: <es:a1a2SearchThresholdRsrp>-110</es:a1a2SearchThresholdRsrp
+>
88144202: <es:a1a2SearchThresholdRsrq>-195</es:a1a2SearchThresholdRsrq
+>
88144203: <es:a2CriticalThresholdRsrp>-122</es:a2CriticalThresholdRsrp
+>
88144204: <es:a2CriticalThresholdRsrq>-195</es:a2CriticalThresholdRsrq
+>
88144205: <es:hysteresisA1A2SearchRsrp>30</es:hysteresisA1A2SearchRsrp
+>
88144206: <es:hysteresisA1A2SearchRsrq>150</es:hysteresisA1A2SearchRsr
+q>
88144207: <es:hysteresisA2CriticalRsrp>10</es:hysteresisA2CriticalRsrp
+>
88144208: <es:hysteresisA2CriticalRsrq>10</es:hysteresisA2CriticalRsrq
+>
88144209: <es:timeToTriggerA1Search>640</es:timeToTriggerA1Search>
88144210: <es
88144211: <es:lbUtranB1ThresholdRscpOffset>0</es:lbUtranB1ThresholdRsc
+pOffset><es:lbQciProfileHandling>1</es:lbQciProfileHandling>
[download]

Problem for me is, that I have no way of fixing these large XML files before I parse them, and I have to parse them as is. Is there a way to ignore the not well-formed lines and have my script continue on with parsing the xml file? Here is my script.

use strict;
use XML::Twig;
my @moList = qw(es:vsDataExternalENodeBFunction es:vsDataTermPointToEN
+B es:vsDataEUtranCellFDD es:vsDataEUtranFreqRelation es:vsDataEUtranC
+ellRelation es:vsDataENodeBFunction es:vsDataExternalEUtranCellFDD);

# Subroutine declarations
sub handle_mo; sub usage;

# Parameter hash
my($key,$value,%param);
foreach my $item (@ARGV){
  my($key,$value) = split /=/, $item;
  $param{$key} = $value;
}
# Required parameters
if (!defined($param{"path"})){
  usage;
  die "No path defined\n";
}
if (!defined($param{"file"})){
  usage;
  die "No xml file defined\n";
}

my $path2xml = $param{"path"};
my $filename = $param{"file"};

my %handlers = map {$_ => \&handle_mo} @moList;
my $twig = new XML::Twig( twig_roots => \%handlers);

$filename = $path2xml . "/" . $filename;
$twig->parsefile($filename);
my $root = $twig->root;
print "Parsing completed\n";

# Subroutines
sub usage {
  print "Usage:\n  xml_parser path=<directory> file=<xml_file>\n";
}
sub handle_mo {
  my ( $t, $elt) = @_;
  print $elt->print, "\n";
  $t->purge;
}
[download]

Which I call using: ./xml_parser.pl path=/tmp file=huge_xml_file.xml > huge_xml_file.parsed.xml

I need to process these large xml files on a daily basis, and each one takes roughly 4 hours to parse on my current system. I never know when one of these miss formed lines will appear, and they are rare. Since there is only one bad line in a 172,881,183 line xml file, I'm wondering is there a way for my parser to ignore these lines rather than throwing the error?

In reply to Ignoring not well-formed (invalid token) errors by brettski

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.