Parsing with SAX an XML Document with no Root Node

arunhorne has asked for the wisdom of the Perl Monks concerning the following question:

Hi... i'm working with the XML::Parser module to implement a SAX handler. My XML snippet is:

<comp id="GWBC18827">
<nam type="preferred">NAD(P)H</nam>
<nam type="alias">NAD(P) H</nam>
<specific>GWBC41</specific>
<specific>GWBC43</specific>
</comp>
<comp id="GWBC43">
<nam type="preferred">NADPH</nam>
<nam type="alias">TPNH</nam>
<cas>2646-71-1</cas>
<for>C21H30N7O17P3</for>
<gen>NAD(P)H</gen> <smi>[C@H]1(O[C@@H]([C@@H](O)[C@H]1OP(=O)([O-])[O-]
+)COP(=O)(OP(=O)(OC[C@H]2[
C@@H](O)[C@@H](O)[C@@H](O2)N3C=C(CC=C3)C(=O)N)[O-])[O-])[N+]4=C5C(=NC4
+)C(=NC
=N5)N</smi> <smi>NC(=O)C1=CN(C=CC1)[C@@H]2O[C@H](COP(=O)(O)OP(=O)(O)OC
+[C@H]3O[C@H]([C@H]
(OP(=O)(O)O)[C@@H]3O)n4cnc5c(N)ncnc45)[C@@H](O)[C@H]2O</smi>
<ref name="EMP">C139</ref>
<ref name="Klotho">KLM0000287</ref>
<ref name="Kegg">C00005</ref>
<ref name="brenda">no ref</ref>
</comp>
<comp id="GWBC41">
<nam type="preferred">NADH</nam>
<nam type="alias">DPNH</nam>
<for>C21H29N7O14P2</for>
<gen>NAD(P)H</gen> <smi>[C@H]1(O[C@@H]([C@@H](O)[C@H]1O)COP(=O)(OP(=O)
+(OC[C@H]2[C@@H](O)[C@@H](
O)[C@@H](O2)N3C=CCC(=C3)C(=O)N)[O-])[O-])[N+]4=C5C(=NC4)C(=NC=N5)N</sm
+i>
<ref name="EMP">C136</ref>
<ref name="Klotho">KLM0000285</ref>
<ref name="Kegg">C00004</ref>
</comp>
[download]

... and the code I wrote to parse it is here:

use XML::Parser;

# Variables
my $xmlp;
my $currentTag = "";

# Prototypes

# Create new parser and set callbacks
$xmlp = new XML::Parser();
$xmlp->setHandlers(
                    Start => \&start,
                    End   => \&end,
                    Char  => \&cdata
                  );

# Parse the file
$xmlp->parsefile("../thes_ex.xml");


# Called when a tag is started
sub start()
{
    # extract variables
    my ($parser, $name, %attr) = @_;
    $currentTag = lc($name);

  print "start $currentTag\n";
}


# Called when a tag is ended
sub end()
{
    # extract variables
    my ($parser, $name) = @_;
    $currentTag = lc($name);
    
  print "ended $currentTag\n";

    # clear value of current tag
    $currentTag = "";
    $cdata = "";
}


# Called when CDATA section found
sub cdata()
{
    my ($parser, $data) = @_;
  
  print "cdata: $data\n";
}
[download]

Given the XML above the following error is given _after_ the first <comp>...</comp> block: "junk after document element, line 7".

Clearly this refers to my lack of having a single root node and as I don't generate the XML I have little control over this... so does anyone know how I can stop this error. I have tried putting in a root tag enclosing the file and this fixes the problem - not an option though in my problem.

Best wishes, Arun

Comment on Parsing with SAX an XML Document with no Root Node Select or Download Code

Replies are listed 'Best First'.
Re: Parsing with SAX an XML Document with no Root Node by mirod (Canon) on May 09, 2002 at 11:29 UTC
Why is adding a root tag around the data (which is not XML BTW, as it does not parse) not an option? You could use a pipe or create a temp file to hold the real XML.: `open( XML, qq{echo "<doc>" cat $file echo "</doc>" \|}) or die "could not open file: $!"; $p->parse( \*XML);` [download] BTW, XML::Parser is not a SAX parser, see XML::SAX::Intro for more details.	[reply] [d/l]
Re: Parsing with SAX an XML Document with no Root Node by ajt (Prior) on May 09, 2002 at 11:38 UTC
Your problem is that your input files are not XML, XML files MUST have only one root node. All XML parsers are required to terminate if the input is not-well formed and if they are validating parsers then they also fail if the file is in-valid. Your file is not well formed and no XML parser will parse it, ever! The only way to make this work is to cut your file up into logical chunks in their own `<comp>` nodes, or as you suggest put the whole file into it's own single root node. Which ever way you do it you must do it before you feed the data into the parser. Why can't you load the file in, wrap it in a single root node and parse that? It easy easy to do. Is there any reason why you can't load the file, and send each `<comp>` one at a time to the parser? You may also wish to look at Tidy and xmllint which are tools that try to fix bad XML/HTML mark-up.	[reply] [d/l] [select]