I've tested it. For my application a regex is often 100 times faster. The difference is that HTML::Parser must examine (and copy out) every single tag in the document. For a typical document, this is ~1000 matches, and 1000 function calls (start/end). A single regex (or even several regexes) that are only interested in a small part of the document are much faster. C may be faster, but HTML::Parser does far more work than I need. I'm not trying to re-implement all of HTML::Parser in perl.
Obviously, if you want to grab lots of different data out of the HTML file, my suggested approach would be inappropriate, and HTML::Parser would be better. But say you just want to write a quick script to grab the temperature in Seattle, WA and put it on your home page. Or say you want to mine weather.com for the temperature in various cities, each and each temperature appears on a different page. HTML::Parser on each page would be painfully slow.
But this debate over the speed of HTML::Parser has gotten away from my original intent. This HTML-matching idea could be implemented using HTML::Parser, and then provide an easier, faster to write, and more readable mechanism for matching HTML than writing a new HTML::Parser script. And I submit this idea to all of you...is this interesting and/or useful?
BTW, for FilterProxy I use index/rindex to traverse the document and find the interesting portion, then using pos() and m/\G/g which is much faster than a single regex for the entire document. But again implementation is not what I'm interested in at this point.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.