Regular expressions have their uses. SGML parsing is not one of them. You've already found one of those situations where it just simply doesn't work. (also, try embeded tables). It's even worse when you try to deal with badly formatted HTML (and there's a whole lot of it out there, thanks to incorrectly written WYSIWYG editors and 'webmasters' who have no idea what HTML is).

Would you care to explain your reasons for not wanting to use existing parsers, as it's possible that there may be other ways to solve your problem.

(I'd personally try to build a tree, if I knew I was always going to be working with well formed SGML, but you haven't even mentioned why you're trying to do this)


In reply to Re: regexp text parsing issue. by jhourcle
in thread regexp text parsing issue. by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.