A grammar for HTML matching

mcelrath has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: A grammar for HTML matching by merlyn (Sage) on Nov 01, 2000 at 11:09 UTC
Keep in mind that the guts of HTML::Parser is written in C. Very fast C. If I recall, the speed of the C version was something like 10 times the speed of the regex version, so I doubt you can hand-roll a regex parser even for a subset of the problem that can beat the C version now. -- Randal L. Schwartz, Perl hacker	[reply]
RE: Re: A grammar for HTML matching by mcelrath (Novice) on Nov 01, 2000 at 11:24 UTC
I've tested it. For my application a regex is often 100 times faster. The difference is that HTML::Parser must examine (and copy out) every single tag in the document. For a typical document, this is ~1000 matches, and 1000 function calls (start/end). A single regex (or even several regexes) that are only interested in a small part of the document are much faster. C may be faster, but HTML::Parser does far more work than I need. I'm not trying to re-implement all of HTML::Parser in perl. Obviously, if you want to grab lots of different data out of the HTML file, my suggested approach would be inappropriate, and HTML::Parser would be better. But say you just want to write a quick script to grab the temperature in Seattle, WA and put it on your home page. Or say you want to mine weather.com for the temperature in various cities, each and each temperature appears on a different page. HTML::Parser on each page would be painfully slow. But this debate over the speed of HTML::Parser has gotten away from my original intent. This HTML-matching idea could be implemented using HTML::Parser, and then provide an easier, faster to write, and more readable mechanism for matching HTML than writing a new HTML::Parser script. And I submit this idea to all of you...is this interesting and/or useful? BTW, for FilterProxy I use index/rindex to traverse the document and find the interesting portion, then using pos() and m/\G/g which is much faster than a single regex for the entire document. But again implementation is not what I'm interested in at this point.	[reply]
RE: RE: Re: A grammar for HTML matching by little (Curate) on Nov 01, 2000 at 12:52 UTC
Ehm, no. You can specify that the parser shall give back everything upto the next occurance of a specified token (tag or text or ..). No faster way doin this, except comparing the whole document with your regex, which can't be faster. :-) So easiest seems to tell the parser to look up for `<tr>, next <td>` and so on. See docs for HTML::Parser and HTML:TokeParser. Have a nice day All decision is left to your taste	[reply] [d/l]
RE: RE: RE: Re: A grammar for HTML matching by Anonymous Monk on Nov 02, 2000 at 01:44 UTC
RE: A grammar for HTML matching by extremely (Priest) on Nov 01, 2000 at 09:48 UTC
I honestly don't see much gain in it. Html that is regular enough to be worth doing this to would be better served in XML. Html that isn't so regular still wouldn't match well even with your theoretical grammar. Html is too irregular to treat in a regular manner and the benefit is too low. Thus the people that most wanted this sort of thing, (like /.'s slashboxes and such) tended to encourage more regular standards in order to get what they wanted without ad-hoc parsing that needs to be tweaked every week or month. Still, if you take a stab at it, I'll tinker with it. It's an interesting example of what Larry Wall has been calling "little languages". Yah never know, you might open up some flood gates and set a buncha people off in a direction you never expected. -- $you = new YOU; honk() if $you->love(perl)	[reply]
RE: A grammar for HTML matching by dchetlin (Friar) on Nov 01, 2000 at 09:56 UTC
It's not clear to me exactly what problem you're trying to solve, but the idea is interesting. It's true that HTML::Parser and friends can be slow. However, I'm sure you're aware of the difficulty of parsing markup correctly. Basically, for something like this there's a tradeoff between speed and generality; my guess is that you could put together something much faster that served your particularly purpose here, but wouldn't scale or solve much else. In other words, I can't tell if you're just looking for something faster than HTML::Parser or Parse::RecDescent, or you have a different generalized approach in mind. Finally, if you haven't seen HTML::TreeBuilder, take a look at that. My guess is that you know about it and would consider it too slow as well, but just in case, there it is. -dlc	[reply]
RE: RE: A grammar for HTML matching by mcelrath (Novice) on Nov 01, 2000 at 10:24 UTC
In other words, I can't tell if you're just looking for something faster than HTML::Parser or Parse::RecDescent, or you have a different generalized approach in mind. Well, both. The idea is that I only care about a small part of the total document. I don't want to have to examine all the irrelevant parts of the document just to get to the part I'm interested in. The benefits of this are speed and invariance to document layout. If you know the summary for the book follows `<p>Summary` you can ignore the rest of the document. I want to respect document structure within the segment I'm interested in, but disregard the rest. HTML::TreeBuilder is a subclass of HTML::Parser, and while this idea could be implemented using it, the idea is that it doesn't have to be. The applications I have in mind for this are: Ad-filtering by stripping selected portions of HTML for my pet project FilterProxy. As I've developed this, my mechanisms for specifying how to find the piece of HTML finding the ad has gone through many revisions and has been pretty convoluted (but I'm evolving toward the syntax in my original message). This "HTML matcher" idea would be perfect. scripts which extract data from web sites without using the entire web page. For instance, ShowTimes, an app for the Palm Pilot, which downloads movie theatres, times, and plot summaries for movies from several websites. (yahoo.com for movies, imdb for summaries). But every time the site(s) go through minor revisions the script breaks. I'm sure there are others who have a custom script to grab a specific piece of data from a web page. Wouldn't it be convienent to just specify a "matcher" like I've outlined?	[reply] [d/l]
RE: RE: RE: A grammar for HTML matching by dchetlin (Friar) on Nov 01, 2000 at 10:48 UTC
I guess I'm still not seeing it -- I've never run into an application like that that HTML::Parser didn't work for, and I don't see why your approach would have to worry less about the document layout. I don't see what about your approach makes it less likely that a minor revision to a site would break something. It's quite possible I'm just not thinking along the right lines, though. It's obvious from your FilterProxy page that you know what you're doing -- if you have ideas of how this approach will be implemented, I encourage you to do it. Perhaps I'd understand once I see an actual example or some code. -dlc, who currently uses tchrist's web proxy, but might try out yours tomorrow...	[reply]
RE: RE: RE: RE: A grammar for HTML matching by mcelrath (Novice) on Nov 01, 2000 at 11:48 UTC
Re: A grammar for HTML matching by brainpan (Monk) on Nov 01, 2000 at 14:34 UTC
So far it sounds like people aren't interested in this idea. FWIW, I would make good use of something like this; as it stands now I have (hangs his head) a few shell scripts doing something similar for me (it was before I knew that 'Perl Syntax' wasn't an oxymoron; I had been led to believe that random ASCII strings were invariably valid perl code). Three days ago one of the sites changed their site design (they seem to have intentionally broken the strict hierarchy I was counting on), and I have yet to get around to figuring out what their new layout is. Having some standardized syntax for "find FOO, then parse out everything until BAZ" would solve this, and if done right would even survive all but the most severe site redesigns. While I see the objections to departing from HTML::Parser, I agree with mcelrath that using a regex to skip past 85K to the 4K of text that you actually want (in many cases a simple grep is all that's needed), would be a Good Thing. If HTML::Parser ends up being part of the solution (if only after the desired portion of the document is reached), then so be it. Having voiced his support for mcelrath's idea, brainpan steps down from his soapbox.	[reply]
Re: A grammar for HTML matching by mitd (Curate) on Nov 01, 2000 at 11:43 UTC
I have had good luck wed data scavaging with HTML::TableExtract and HTML::Parser. Generalizing this approach is more problematic. Target data location is, more often than not, a huge intra-document navigation problem. The solutions usually demand individual customization. But when target data has is tabular and formatted in HTML tables you can't beat HTML::TableExtract. It's abiltiy to handle tables structure both symbolically and by index, nested or standalone is very powerful. mitd-Made in the Dark 'My favourite colour appears to be grey.'	[reply]
RE: A grammar for HTML matching by clemburg (Curate) on Nov 01, 2000 at 23:48 UTC
I think your approach will be worth following, but I have not seem something like this until now. The big difference between your proposal and the existing modules, if I understand you correctly, is that you want to create (e.g., with this "tagblock" syntax) a special-purpose regex-based parser that ignores nearly all of the document and just dives into the parts that match the input specification. This could of course be faster than a full parse of the whole document tree. The main point here is: what will be the grammar for your input specification syntax? If you could post some proposal for it in a more formal notation (e.g., BNF), it would be interesting to work on this problem. From a performance point, the result of the compiled specification syntax should most surely not be one big regex (a lot of alternation etc. will make it slow, and the "Little Engine that couldn't" problems looms, too), but a closure that performs the needed logical tests in concert with the generated regexes. The main point here is of course that using your approach people that would not be able to use regexes at all will be able to do some quite sophisticated regex matching. Maybe this warrants some search, too - any "user friendly" regex specification packages out there? Christian Lemburg Brainbench MVP for Perl http://www.brainbench.com	[reply]
RE: RE: A grammar for HTML matching by Anonymous Monk on Nov 02, 2000 at 01:22 UTC
The main point here is: what will be the grammar for your input specification syntax? If you could post some proposal for it in a more formal notation (e.g., BNF), it would be interesting to work on this problem. I'm working on this. Having never written a BNF form grammar, It'll take me a while to put it all together. I'm playing with Parse::RecDescent to do this. I'm having conceptual problems in coming up with a syntax that encompasses all the features I want, and trying to find ideas for features I haven't thought of. For instance, say I want to grab everything between two comments? Also say I want to expand the match to include tags that start immediately preceeding the match and close immediately following it. i.e. `<b><center>...match...</center></b>` I want to suck in the surrounding `<center>` and `<b>` tags. Ideas anyone? What do you want to match?	[reply] [d/l] [select]
RE: RE: RE: A grammar for HTML matching by clemburg (Curate) on Nov 02, 2000 at 18:48 UTC
<cite>What do you want to match?</cite> Well, certainly not too complicated things. The point is probably to express the possible relationships between tags (e.g., contained in, preceding, following), and not the tags in themselves. Obviously this is not too trivial because of all these nifty exceptions that are allowed in HTML. Maybe it would be good to divide the parsing phase into a "candidate recognition" phase (purely regex based) and a "HTML parsing" phase, where you would expand the snippet found to canonical HTML syntax. Christian Lemburg Brainbench MVP for Perl http://www.brainbench.com	[reply]
XML::XPath, anyone? by merlyn (Sage) on Nov 02, 2000 at 19:00 UTC
See also... by Anonymous Monk on Nov 04, 2000 at 16:11 UTC
Helo, Bastian (author of webcleaner), bob (filterproxy) and me (aifilter) are currently discussing these things via email. I invite evryone to have a look at bloodgate.com/aifilter (and especially at the documents inside doc/performance of the downloadade source) to see how I solved the problem with HTML::Parser. The parser is fast, but if you need it to re-parse the document 10 times it starts to get real slow. Of course, bob is right, regexps can be faster than the parser. Anyway, suggestions, opinions, comments etc to this site, or via email. Thanx in advance, Tels (bloodgate.com)	[reply]
Re: A grammar for HTML matching by elwarren (Priest) on Nov 02, 2000 at 00:45 UTC
I would use it. I believe what he wants to do is match a code block and extract the content. I have written several small scripts that run via cron to collect some information and file it away. I ended up using two different methods to grab what I wanted. The first was just to scan the html looking for a comment line and grabbing most everything after it. That was the easy one. The second site was more complicated and the data I was trying to extract was in a large table that changed size depending on what they were displaying. I didn't feel like learning html::parser at the time and I hadn't found html::tableextract either. I cheated and piped the page through lynx and grabbed what I wanted from the parsed text output. So neither of those methods would help you :-) but if you put something like this together I would use it. I still have to take a look at html::tableextract, but I'll get around to it. I've seen a few packages on freshmeat.net that will snag comic strips off the web and put them somewhere for you. They might have some good techniques for extracting that stuff. HTH	[reply]
RE: A grammar for HTML matching by AgentM (Curate) on Nov 01, 2000 at 20:56 UTC
Have you considering filtering out exactly what you DON'T need and are sure of it (let's say first x bytes or up until a special tag) and then using HTML::Parser on the remaining data? That way, you don't need to match any formats in your important search. In general, if you know the bounds of the useful information, you should be able throw away the rest and perform the searching on the relevant data. I certainly agree with you that your method would be most efficient albeit not very flexible. I really see two steps here. Of course, Benchmarks may prove me wrong since HTML::Parser isn't even Perl. AgentM Systems nor Nasca Enterprises nor Bone::Easy nor Macperl is responsible for the comments made by AgentM. Remember, you can build any logical system with NOR.	[reply]
RE: A grammar for HTML matching by runrig (Abbot) on Nov 01, 2000 at 09:34 UTC
Update: You are right. That'll teach me to not pay attention ;-)	[reply]
RE: RE: A grammar for HTML matching by mcelrath (Novice) on Nov 01, 2000 at 09:48 UTC
You obviously didn't read the entire message.	[reply]


more useful options
	PerlMonks

A grammar for HTML matching

AgentM Systems nor Nasca Enterprises nor Bone::Easy nor Macperl is responsible for the comments made by AgentM. Remember, you can build any logical system with NOR.