Perl Monks, I have a 13GB xml file that I'm trying to parse through using XML::Twig, and after getting about half way though the xml file (88,144,211 lines of 172,881,183) the parsing stopped and threw the error:

not well-formed (invalid token) at line 88144211, column 36, byte -1674366310 at /usr/perl5/site_perl/5.12/i86pc-solaris-64int/XML/Parser.pm line 187 at .../bin/xml_parser.pl line 54 at .../bin/xml_parser.pl line 54

I used sed to pull the not well formed line from the xml file, along with the preceeding 10 lines, which I've pasted here prefixed with the line numbers, and I think it's actually line 88144210 that caused my issue.

sed -n "88144200,88144211p;88144211q;" huge_xml_file.xml

88144200: <es:vsDataReportConfigSearch> 88144201: <es:a1a2SearchThresholdRsrp>-110</es:a1a2SearchThresholdRsrp +> 88144202: <es:a1a2SearchThresholdRsrq>-195</es:a1a2SearchThresholdRsrq +> 88144203: <es:a2CriticalThresholdRsrp>-122</es:a2CriticalThresholdRsrp +> 88144204: <es:a2CriticalThresholdRsrq>-195</es:a2CriticalThresholdRsrq +> 88144205: <es:hysteresisA1A2SearchRsrp>30</es:hysteresisA1A2SearchRsrp +> 88144206: <es:hysteresisA1A2SearchRsrq>150</es:hysteresisA1A2SearchRsr +q> 88144207: <es:hysteresisA2CriticalRsrp>10</es:hysteresisA2CriticalRsrp +> 88144208: <es:hysteresisA2CriticalRsrq>10</es:hysteresisA2CriticalRsrq +> 88144209: <es:timeToTriggerA1Search>640</es:timeToTriggerA1Search> 88144210: <es 88144211: <es:lbUtranB1ThresholdRscpOffset>0</es:lbUtranB1ThresholdRsc +pOffset><es:lbQciProfileHandling>1</es:lbQciProfileHandling>

Problem for me is, that I have no way of fixing these large XML files before I parse them, and I have to parse them as is. Is there a way to ignore the not well-formed lines and have my script continue on with parsing the xml file? Here is my script.

use strict; use XML::Twig; my @moList = qw(es:vsDataExternalENodeBFunction es:vsDataTermPointToEN +B es:vsDataEUtranCellFDD es:vsDataEUtranFreqRelation es:vsDataEUtranC +ellRelation es:vsDataENodeBFunction es:vsDataExternalEUtranCellFDD); # Subroutine declarations sub handle_mo; sub usage; # Parameter hash my($key,$value,%param); foreach my $item (@ARGV){ my($key,$value) = split /=/, $item; $param{$key} = $value; } # Required parameters if (!defined($param{"path"})){ usage; die "No path defined\n"; } if (!defined($param{"file"})){ usage; die "No xml file defined\n"; } my $path2xml = $param{"path"}; my $filename = $param{"file"}; my %handlers = map {$_ => \&handle_mo} @moList; my $twig = new XML::Twig( twig_roots => \%handlers); $filename = $path2xml . "/" . $filename; $twig->parsefile($filename); my $root = $twig->root; print "Parsing completed\n"; # Subroutines sub usage { print "Usage:\n xml_parser path=<directory> file=<xml_file>\n"; } sub handle_mo { my ( $t, $elt) = @_; print $elt->print, "\n"; $t->purge; }

Which I call using: ./xml_parser.pl path=/tmp file=huge_xml_file.xml > huge_xml_file.parsed.xml

I need to process these large xml files on a daily basis, and each one takes roughly 4 hours to parse on my current system. I never know when one of these miss formed lines will appear, and they are rare. Since there is only one bad line in a 172,881,183 line xml file, I'm wondering is there a way for my parser to ignore these lines rather than throwing the error?


In reply to Ignoring not well-formed (invalid token) errors by brettski

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.