Note that parsing HTML/XML using regexen is generally a really bad idea.

The reason that it often works (for some definition of "works") is that few dynamic sites actually build and serialize a DOM tree, instead simply inserting details into (textual) templates. Regexen can match the parts of the output that come from the template, thereby selecting the insertions and extracting the desired information.

The resulting parsers tend to be somewhat fragile, as any change to the template can invalidate the "islands" on which that the regex-based scraper relies, but can be suitable for tools that are needed quickly and for the short-term, or where inconveniences adapting the tool when the site changes are acceptable. The upside is that regex-based parsers are relatively easily written from inspecting the HTML page source without requiring knowledge of DOM structure and handling, giving them a lower "barrier of entry" for programmers unfamiliar with SGML/XML/DOM concepts.


In reply to Re^2: I match a pattern in regex, yet I don't get the group I wanted to extract for some reason by jcb
in thread I match a pattern in regex, yet I don't get the group I wanted to extract for some reason by SergioQ

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.