Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to implement an XML file parser in Perl. However perl seems to read the file as if it was a .TXT file , where one line in the .XML file is being seen as several lines in the .TXT file. When the xml file is viewed using Internet Explorer the line appears as one line but if I open the same file using Notepad the one line appears as several lines so my parser seems to be reading the xml file as a text file and i'm not getting the whole input line to be parsed. Any ideas ?

Replies are listed 'Best First'.
Re: Reading a .XML file
by Joost (Canon) on Aug 12, 2002 at 10:32 UTC
    I am trying to implement an XML file parser in Perl.
    You do know there are already several XML modules available for you?
    However perl seems to read the file as if it was a .TXT file
    Perl always reads a file as either text or binary on windows, and always as binary (or text, depending on how you look at it) on unix. There are some changes in 5.8 (the new PerlIO layer) but you are probably not using that.
    where one line in the .XML file is being seen as several lines in the .TXT file. When the xml file is viewed using Internet Explorer the line appears as one line but if I open the same file using Notepad the one line appears as several lines
    Internet explorer probably sticks several lines together when showing an XML file. There ARE several lines, and that's you perl reads it. Don't be fooled by IE.
    Any ideas ?
    Parsing XML right is more difficult than you may think. I would suggest using XML::Simple for small files (that you can keep in memory), and XML::Parser and relatives for bigger ones.

    Hope this helps,
    Joost.

    -- Joost downtime n. The period during which a system is error-free and immune from user input.
Re: Reading a .XML file
by demerphq (Chancellor) on Aug 12, 2002 at 10:38 UTC
    Well, there already a number of XML parsers available for perl.

    XML::Parser XML::TreeBuilder XML::Simple, in fact when I do a search of modules on CPAN that match /^XML/ I get no less than 341 matches!

    And given the nature of your question im suspecting that probably (and no offense intended) you arent going to come up with somthing superior.

    To answer you question however, XML files _are_ text file. Thats part of their charm. When they are displayed in IE it renders them in a relatively intuitive and simple format, but they way it renders it may even be subtly different from the way it is actually contained in the file. This includes showing the attributes of a tag on one line. This has nothing to do with how they are stored in a file, nor does it have anything to do with how you open it. An example:

    <?xml version="1.0" encoding="ISO-8859-1"?> <foo bar="baz"> <weird a='1' b='2' c='3' /> <nested > text </nested> <empty_nest> </empty_nest> </foo>
    Renders like this in IE
    <?xml version="1.0" encoding="ISO-8859-1" ?> - <foo bar="baz"> <weird a="1" b="2" c="3" /> <nested>text</nested> <empty_nest /> </foo>
    Note that the <empty_nest></empty_nest> tag has been converted to a "endless tag" <empty_nest />, so what you see in IE is only an abstract representation of what is in the file.

    Most of this follows from the very nature of XML and markup languages in general. Normally they arent line oriented but rather stream oriented, where the stream is composed of tags and data. And to be honest because of this flexibility writing correct parsers for them is non trivial.

    If this is a project for fun or learning, then you have much research to do. If you are doing this 'cause you didnt know there were already excellent XML parsers then I would say have a trawl through CPAN and dont waste your time.

    HTH

    Yves / DeMerphq
    ---
    Software Engineering is Programming when you can't. -- E. W. Dijkstra (RIP)