Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

XML::Parser and multiple results

by Ineffectual (Scribe)
on Feb 19, 2004 at 00:57 UTC ( [id://330102]=perlquestion: print w/replies, xml ) Need Help??

Ineffectual has asked for the wisdom of the Perl Monks concerning the following question:

I have a file that has multiple XML output results in them, with the typical format of:

<output>
    <program></program>
    <version></version>
    <information></information>
    <data></data>
</output>
<output 2>
etc..

XML::Parser errors and says "junk after document element at line X <output 2> at XML/Parser.pm line 186
Is there a way to get XML::Parser to continue down the file or do I need to split the file into multiple files? (That would be a pain because there are roughly 46k of these results.) Any other ideas of what I can do to not have to split this into 46k different files to parse through it? I'm fairly attached to XML::Parser, so a solution with that would be best. Thanks for the help.
Ineff

Replies are listed 'Best First'.
Re: XML::Parser and multiple results
by leriksen (Curate) on Feb 19, 2004 at 01:28 UTC
    try wrapping them in a <results> tag, and either add id attributes or remove the numbers
    <results> <output> <program></program> <version></version> <information></information> <data></data> </output> <output id="2"> ... </output> </results>
    Your only allowed one root element in XML, IIRC.

    +++++++++++++++++
    #!/usr/bin/perl
    use warnings;use strict;use brain;

Re: XML::Parser and multiple results
by borisz (Canon) on Feb 19, 2004 at 01:08 UTC
    Maybe I get it wrong, but if you have
    <output> ... </output> <output 2> </output 2>
    that is not xml. what you can try is transform your illformed xml to valid xml and parse again.
    <output> ... </output> <output> </output>
    this whould be fine. Or if you need the number
    <output> ... </output> <output id="2"> </output>
    Boris
Re: XML::Parser and multiple results
by kvale (Monsignor) on Feb 19, 2004 at 01:12 UTC
    Hmm, I don't think  <output 2> is a valid element in XML. Whitespace is used to separate the element from attributes and 2 is not a valid attribute.

    XML elements must follow these naming rules:

    Names can contain letters, numbers, and other characters Names must not start with a number or punctuation character Names must not start with the letters xml (or XML or Xml ..) Names cannot contain spaces

    So it would be best to fix the XML first.

    -Mark

      I was giving example elements and not real elements. The XML is correctly formed other than the fact that there are multiple results in the same file. Each Output on its own is a valid file. I don't really want to pollute my filesystem with 46k -> 100k individual files that are clunky to move around, so I'm trying to figure out if there's a way to parse one file instead of splitting each output into its own file. From what I can see from the responses, the only way would be to have perl split the big file into a hash or array and feed that into XML::Parser as individual "files"... Does this seem to be correct?

      Ineff

        Does this XML need to validate against a schema? If not, could you prepend an opening tag and append a closing tag, wrapping multiple nodes in one parent node?

        From what I can see from the responses, the only way would be to have perl split the big file into a hash or array and feed that into XML::Parser as individual "files"... Does this seem to be correct?
        No, it is perfectly valid to have any number of tags there. If your output is valid. But XML::Parser stops, so I doubt it is valid.
        Boris
Re: XML::Parser and multiple results
by mirod (Canon) on Feb 19, 2004 at 08:39 UTC

    An XML document can only have 1 (one) root. Hence your XML is not valid.

    Fortunately XML::Parser has a Stream_Delimiter option:

    * Stream_Delimiter
                   This is an Expat option. It takes a string value. When this
                   string is found alone on a line while parsing from a stream,
                   then the parse is ended as if it saw an end of file. The
                   intended use is with a stream of xml documents in a MIME multi‐
                   part format. The string should not contain a trailing newline.
    
      I'm not really sure what this could do.. If I put a stream delimiter at the end of my really big file, will it parse all the way through it? If so, could I use that to load the entire file into a really big hash? Thanks. Ineff

        You don't have 1 XML document, as it is you have a number of them, in a single file. You coult insert stream delimiters between those documents to get XML::Parser to unserstand they they should be treated as separate XML documents.

        You could also wrap them all in a root tag, either in the main file, or by using an entity that includes it, see Re: XML log files.

        Oh, and I found a FAQ about it in the XML::Twig FAQ :--)

        You really have to understand that at this point you don't have XML. If it doesn't parse, then it is not XML. If you want to have XML you have to get your data to be XML, or to use the mildly hacky stream delimiter option.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://330102]
Approved by kvale
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (4)
As of 2024-04-23 19:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found