Here is another argument for your case:

A regex is a Graph. HTML::TreeBuilder/Mojo::DOM produce something very similar but much less complex: a (directed, acyclic) Graph, i.e. the HTML Tree, the DOM. Where each HTML token/node in that tree is represented by separate regexs and can be conveniently considered as a black box and put aside or switched-off as a separate sub() so-to-speak. Somebody parsing with a single regex is actually smashing all the black boxes and building everything at the character-level: both the identification of the HTML tokens and the HTML syntax tree. That's 2 different sets of rules put into one logic unit. What's more, the 2nd set of rules makes distinction between tags, attributes, values, content. It's much higher-level than the first one. It's much more difficult to retain the meaning of "tag" and re-use it. This is a task of huge complexity. Sooner or later who follows the regex method will either re-discover HTML::TreeBuilder (directly or indirectly via regex embeded code) or die trying.

Then, once you have the DOM tree you can query it as many times as you like and quite efficiently too because you are using the right tool: a Tree data structure operating at the tag level. Whereas -- correct me if I am wrong here but -- with a regex you must re-parse the same HTML content, at the character level, for each query.

Plus the TreeBuilder method can be easier to re-cycle being higher level. It can be serialised, saved, reloaded, passed as function param by reference.

p.s. something to visualise the herculean task of a regex-engine: https://regexper.com/

bw, bliako


In reply to Re: Why a regex *really* isn't good enough for HTML and XML, even for "simple" tasks by bliako
in thread Why a regex *really* isn't good enough for HTML and XML, even for "simple" tasks by haukex

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.