Fellow monks, this is a meditation on something that I've noticed many people ask for, and do, when they have to solve a specific text processing task using Perl...
They use regexes!

Now, many of you may be wondering, what is *his* problem ? regexes are great, right ? Sure they are... but perhaps the statement I made above should be amended to read, "they use regexes without a thought for any other easier to use options available to them"... and this is where this meditation (some would be justified in calling this a rant) starts....

Consider the mantra at Perlmonks, use CGI, use CGI, use CGI.. don't ever think about rolling your own code for parsing the input parameters.. The reasons for which this statement is made are equally applicable to any number of tasks... specifically, the one I would wish to address is that of parsing/munging/extracting elements from HTML...

Just today, I saw someone ask for a regex to extract HREF blocks from an HTML file.. and I wondered, why ? Is it necessary to use a regex for something that can as easily be abstracted away to a module built for the task ? and the answer is, of course, an emphatic NO!...

Consider a recent node about why its not acceptable to avoid the use of CGI.pm... can't the same be said for this task ? of course it can... so, my new mantra for any/most who ask for a quickie regex is, use a module, use a module, use a module..

CPAN is packed with modules for parsing HTML, my favourite being HTML::TokeParser.. some others that are definitely worth looking at include |HTML::Parser, and as mentioned here, HTML::Filter.. any or all of those modules can be used directly for token recognition and munging of HTML in general, and they *can* have significant advantages over a first pass regex written by an average Perl user... ie: they're pretty fast, they're less error prone, they catch the edge cases that most regex authors don't immediately think of handling, and for the most part, these modules have been "eyeballed" by countless others, so your efforts have already been partially validated by others... not so with a regex...

So, when next you think of doing something complicated with HTML munging, head over to CPAN and take a look around there... then (if you must) think about rolling up your sleeves and writing a regex.. The time spent at CPAN is time well spent..

feels much better after letting that off his chest.. thanks for reading..


In reply to Picking the best way.... by tinman

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.