ecuguru has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I've got crappy formatted xml files that don't have the over-arching tags:
<record> <name>NAME1</name> </record> <record> <name>NAME2</name> </record>

NOTE-> That's the whole file. There isn't any top most or bottom most tag to encapsulate the above XML.

The problem is, when I go to parse it, it thinks that the remaining XML is junk in the XML file. For XML parsing, I was thinking I could write a new temp file with a starting tag, write the original xml file into the temp file, and then concat the closing line. Parse it, then erase it.

I'm not sure if that's the best approach or not, if there is a better way to do it, I'm all ears. I'm not as worried about getting code itself, as if this the best approach or not
Thanks!

Replies are listed 'Best First'.
Re: XML Cleanup
by mirod (Canon) on May 23, 2008 at 10:08 UTC

    There are many ways to do this, but the one I usually prefer is to take advantage of XML entities: what I parse is <!DOCTYPE doc [<!ENTITY real_doc SYSTEM "$doc_file">]><doc>&real_doc;</doc>. The XML string references the real file through the entity.

    This way you don't need a temporary file, and you don't touch your original data. Plus you get XML cred, amaze your friends, impress your boss... why would you do it any other way?

    You can even check that it works by running this code, that tests it both with XML::Parser and with XML::LibXML:

    #!/usr/bin/perl use strict; use warnings; use XML::Parser; use XML::LibXML; my $doc_file= shift @ARGV; my $xml=qq{<!DOCTYPE doc [<!ENTITY real_doc SYSTEM "$doc_file">]><doc> +&real_doc;</doc>}; { print "XML::Parser:\n"; my $t= XML::Parser->new( Style => 'Stream')->parse( $xml); } { print "XML::LibXML:\n"; my $parser = XML::LibXML->new(); my $doc = $parser->parse_string($xml); print $doc->toString; }

      Neat trick!

Re: XML Cleanup
by moritz (Cardinal) on May 23, 2008 at 09:02 UTC
    For XML parsing, I was thinking I could write a new temp file with a starting tag, write the original xml file into the temp file, and then concat the closing line. Parse it, then erase it.

    To me that sounds as sane as you can be with such broken files.

    Be sure to also include a <?xml ...> preamble with the right encoding information, otherwise non-ASCII chars will surely break your program.

Re: XML Cleanup
by psini (Deacon) on May 23, 2008 at 09:23 UTC

    If your XML file is not huge, you could consider reading it in a variable, prepend/append the open/close root tag and then parse it. Most (if not all) XML parsers allow parsing from a string var

    Rule One: Do not act incautiously when confronting a little bald wrinkly smiling man.