It depends whether you're giving or taking.

If you are taking at the output from one single program emitting its own brew of XML, you will usually find that it is always emitted in exactly the same way, often pretty-printed with indented nested elements, or hard wrapped against column zero all the way down.

It is extremely rare (in my experience) to encounter XML emitted by a program that is neatly word-wrapped at or before column 72. After all, that takes a lot more work, and most sane programmers have better things to do with their time. Once you figure out empirically how a given program emits its XML, you can count on it being invariant.

So, as much as it may shock the purists, you can quite easily get away with picking out what you want from a big XML file with a regexp or two, especially if you don't have to worry about context. By that I mean, for example, extracting the contents of element <HG>, if the parent is <BAR> except when the grand parent is <ZONK>

You just need a good test-suite to cover your a.. code, to ensure that things don't break when the source program is upgraded.

You cannot adopt this approach when it is you who has written the XML specification and you're dealing with how people give you their information according to your spec. Everyone will do it differently and you will indeed have to parse it. Update: or you're taking the information from a web service and thus don't have any control or forewarning when the originating program may be upgraded.

That's been my rule so far in dealing with SGML and XML for over 15 years and it has served me well so far.

• another intruder with the mooring in the heart of the Perl


In reply to Re: XML parsing vs Regular expressions by grinder
in thread XML parsing vs Regular expressions by karpatov

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.