Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

A grammar for HTML matching

by mcelrath (Novice)
on Nov 01, 2000 at 09:33 UTC ( [id://39379]=perlquestion: print w/replies, xml ) Need Help??

mcelrath has asked for the wisdom of the Perl Monks concerning the following question:

I wish to match (in a regular expression sense) data in an HTML file. Consider an application wishing to extract data from some web-database like IMDB, CDDB, Yahoo TV, or Perl Monks. In order to do this one must match a tidbit of html containing the data you seek. The data itself is rarely contained in the tags, but by specifying the tag structure, one could in principle write a matcher to find said tidbit.

For a concrete example, consider the following snippet of HTML:

<table> <th> <td>Title</td> <td>Subject</td> <td>ISBN</td> <td>Publisher</td> <td>Synopsis</td> </th> <tr> <td>Ender's Game</td> <td>Science Fiction</td> <td>12375856996</td> <td>Macmillan</td> <td>a good book</td></tr> </tr> </table>
If I wished to extract data from this table I might construct a "matcher" using the following (made-up) syntax:
tagblock <td> following tag <tr> inside { tagblock <table> containing { tagblock <th> { containing regex /Subject/ AND containing regex /ISBN/ AND containing regex /Publisher/ } } }
This would match <td>Ender's Game</td>, successfully extracting the title of the book. Such a matching scheme is relatively impervious to changes in layout, or even changes in the number of table rows or columns.

Consider a second example to find a doubleclick.net advertisement:

tagblock <table width=468 height=60> containing { (tag <img> AND tag <a href =~ /doubleclick.net/>) OR tag <a href =~ /flycast.com/> OR tag <!--/Begin Ad/--> } regex /doubleclick.net/ inside { attrib </img|script|i?frame|i?layer/ src> OR attrib <a href> } inside tag <a|img|script|i?frame|i?layer>

These examples are verbose, and are examples of something that doesn't exist, but I hope the syntax is demonstrative of what I'm trying to do. This could be done using HTML::Parser or HTML::TokeParser and a good bit of work, but it would be very slow (consider the second example: finding doubleclick.net with a regex is much faster than examining every tag with HTML::Parser), and it would be fragile with respect to changes in the HTML document's structure, layout, and content. Of course, once the grammer is written and parsed (Parse::RecDescent), one could implement the matcher using HTML::Parser. By creating a grammar for doing pattern matching on HTML, in a manner that obeys document structure, one could have fast matching with a relatively intuitive matching language. Trying to obey document structure allows one to then modify the matched tidbit without destroying document structure by leaving unclosed tags.

The question is: has this been done? Has anything similar been attempted? Is anyone working on such a thing? Does anyone want to work on such a thing? Do other people besides me have a use for such a thing?

Replies are listed 'Best First'.
Re: A grammar for HTML matching
by merlyn (Sage) on Nov 01, 2000 at 11:09 UTC
    Keep in mind that the guts of HTML::Parser is written in C. Very fast C. If I recall, the speed of the C version was something like 10 times the speed of the regex version, so I doubt you can hand-roll a regex parser even for a subset of the problem that can beat the C version now.

    -- Randal L. Schwartz, Perl hacker

      I've tested it. For my application a regex is often 100 times faster. The difference is that HTML::Parser must examine (and copy out) every single tag in the document. For a typical document, this is ~1000 matches, and 1000 function calls (start/end). A single regex (or even several regexes) that are only interested in a small part of the document are much faster. C may be faster, but HTML::Parser does far more work than I need. I'm not trying to re-implement all of HTML::Parser in perl.

      Obviously, if you want to grab lots of different data out of the HTML file, my suggested approach would be inappropriate, and HTML::Parser would be better. But say you just want to write a quick script to grab the temperature in Seattle, WA and put it on your home page. Or say you want to mine weather.com for the temperature in various cities, each and each temperature appears on a different page. HTML::Parser on each page would be painfully slow.

      But this debate over the speed of HTML::Parser has gotten away from my original intent. This HTML-matching idea could be implemented using HTML::Parser, and then provide an easier, faster to write, and more readable mechanism for matching HTML than writing a new HTML::Parser script. And I submit this idea to all of you...is this interesting and/or useful?

      BTW, for FilterProxy I use index/rindex to traverse the document and find the interesting portion, then using pos() and m/\G/g which is much faster than a single regex for the entire document. But again implementation is not what I'm interested in at this point.

        Ehm, no.
        You can specify that the parser shall give back everything upto the next occurance of a specified token (tag or text or ..).
        No faster way doin this, except comparing the whole document with your regex, which can't be faster. :-)
        So easiest seems to tell the parser to look up for <tr>, next <td> and so on.
        See docs for HTML::Parser and HTML:TokeParser.

        Have a nice day
        All decision is left to your taste
RE: A grammar for HTML matching
by extremely (Priest) on Nov 01, 2000 at 09:48 UTC
    I honestly don't see much gain in it. Html that is regular enough to be worth doing this to would be better served in XML. Html that isn't so regular still wouldn't match well even with your theoretical grammar.

    Html is too irregular to treat in a regular manner and the benefit is too low. Thus the people that most wanted this sort of thing, (like /.'s slashboxes and such) tended to encourage more regular standards in order to get what they wanted without ad-hoc parsing that needs to be tweaked every week or month.

    Still, if you take a stab at it, I'll tinker with it. It's an interesting example of what Larry Wall has been calling "little languages". Yah never know, you might open up some flood gates and set a buncha people off in a direction you never expected.

    --
    $you = new YOU;
    honk() if $you->love(perl)

RE: A grammar for HTML matching
by dchetlin (Friar) on Nov 01, 2000 at 09:56 UTC
    It's not clear to me exactly what problem you're trying to solve, but the idea is interesting. It's true that HTML::Parser and friends can be slow. However, I'm sure you're aware of the difficulty of parsing markup correctly. Basically, for something like this there's a tradeoff between speed and generality; my guess is that you could put together something much faster that served your particularly purpose here, but wouldn't scale or solve much else.

    In other words, I can't tell if you're just looking for something faster than HTML::Parser or Parse::RecDescent, or you have a different generalized approach in mind.

    Finally, if you haven't seen HTML::TreeBuilder, take a look at that. My guess is that you know about it and would consider it too slow as well, but just in case, there it is.

    -dlc

      In other words, I can't tell if you're just looking for something faster than HTML::Parser or Parse::RecDescent, or you have a different generalized approach in mind.

      Well, both. The idea is that I only care about a small part of the total document. I don't want to have to examine all the irrelevant parts of the document just to get to the part I'm interested in. The benefits of this are speed and invariance to document layout. If you know the summary for the book follows <p>Summary you can ignore the rest of the document. I want to respect document structure within the segment I'm interested in, but disregard the rest.

      HTML::TreeBuilder is a subclass of HTML::Parser, and while this idea could be implemented using it, the idea is that it doesn't have to be.

      The applications I have in mind for this are:

      1. Ad-filtering by stripping selected portions of HTML for my pet project FilterProxy. As I've developed this, my mechanisms for specifying how to find the piece of HTML finding the ad has gone through many revisions and has been pretty convoluted (but I'm evolving toward the syntax in my original message). This "HTML matcher" idea would be perfect.
      2. scripts which extract data from web sites without using the entire web page. For instance, ShowTimes, an app for the Palm Pilot, which downloads movie theatres, times, and plot summaries for movies from several websites. (yahoo.com for movies, imdb for summaries). But every time the site(s) go through minor revisions the script breaks. I'm sure there are others who have a custom script to grab a specific piece of data from a web page. Wouldn't it be convienent to just specify a "matcher" like I've outlined?
        I guess I'm still not seeing it -- I've never run into an application like that that HTML::Parser didn't work for, and I don't see why your approach would have to worry less about the document layout. I don't see what about your approach makes it less likely that a minor revision to a site would break something.

        It's quite possible I'm just not thinking along the right lines, though. It's obvious from your FilterProxy page that you know what you're doing -- if you have ideas of how this approach will be implemented, I encourage you to do it. Perhaps I'd understand once I see an actual example or some code.

        -dlc, who currently uses tchrist's web proxy, but might try out yours tomorrow...

Re: A grammar for HTML matching
by brainpan (Monk) on Nov 01, 2000 at 14:34 UTC
    So far it sounds like people aren't interested in this idea. FWIW, I would make good use of something like this; as it stands now I have (hangs his head) a few shell scripts doing something similar for me (it was before I knew that 'Perl Syntax' wasn't an oxymoron; I had been led to believe that random ASCII strings were invariably valid perl code). Three days ago one of the sites changed their site design (they seem to have intentionally broken the strict hierarchy I was counting on), and I have yet to get around to figuring out what their new layout is. Having some standardized syntax for "find FOO, then parse out everything until BAZ" would solve this, and if done right would even survive all but the most severe site redesigns.

    While I see the objections to departing from HTML::Parser, I agree with mcelrath that using a regex to skip past 85K to the 4K of text that you actually want (in many cases a simple grep is all that's needed), would be a Good Thing. If HTML::Parser ends up being part of the solution (if only after the desired portion of the document is reached), then so be it.

    Having voiced his support for mcelrath's idea, brainpan steps down from his soapbox.
Re: A grammar for HTML matching
by mitd (Curate) on Nov 01, 2000 at 11:43 UTC
    I have had good luck wed data scavaging with HTML::TableExtract and HTML::Parser. Generalizing this approach is more problematic. Target data location is, more often than not, a huge intra-document navigation problem. The solutions usually demand individual customization.

    But when target data has is tabular and formatted in HTML tables you can't beat HTML::TableExtract. It's abiltiy to handle tables structure both symbolically and by index, nested or standalone is very powerful.

    mitd-Made in the Dark
    'My favourite colour appears to be grey.'

RE: A grammar for HTML matching
by clemburg (Curate) on Nov 01, 2000 at 23:48 UTC

    I think your approach will be worth following, but I have not seem something like this until now.

    The big difference between your proposal and the existing modules, if I understand you correctly, is that you want to create (e.g., with this "tagblock" syntax) a special-purpose regex-based parser that ignores nearly all of the document and just dives into the parts that match the input specification. This could of course be faster than a full parse of the whole document tree.

    The main point here is: what will be the grammar for your input specification syntax? If you could post some proposal for it in a more formal notation (e.g., BNF), it would be interesting to work on this problem.

    From a performance point, the result of the compiled specification syntax should most surely not be one big regex (a lot of alternation etc. will make it slow, and the "Little Engine that couldn't" problems looms, too), but a closure that performs the needed logical tests in concert with the generated regexes.

    The main point here is of course that using your approach people that would not be able to use regexes at all will be able to do some quite sophisticated regex matching. Maybe this warrants some search, too - any "user friendly" regex specification packages out there?

    Christian Lemburg
    Brainbench MVP for Perl
    http://www.brainbench.com

      The main point here is: what will be the grammar for your input specification syntax? If you could post some proposal for it in a more formal notation (e.g., BNF), it would be interesting to work on this problem.

      I'm working on this. Having never written a BNF form grammar, It'll take me a while to put it all together. I'm playing with Parse::RecDescent to do this. I'm having conceptual problems in coming up with a syntax that encompasses all the features I want, and trying to find ideas for features I haven't thought of. For instance, say I want to grab everything between two comments? Also say I want to expand the match to include tags that start immediately preceeding the match and close immediately following it. i.e. <b><center>...match...</center></b> I want to suck in the surrounding <center> and <b> tags.

      Ideas anyone? What do you want to match?

        <cite>What do you want to match?</cite>

        Well, certainly not too complicated things. The point is probably to express the possible relationships between tags (e.g., contained in, preceding, following), and not the tags in themselves. Obviously this is not too trivial because of all these nifty exceptions that are allowed in HTML. Maybe it would be good to divide the parsing phase into a "candidate recognition" phase (purely regex based) and a "HTML parsing" phase, where you would expand the snippet found to canonical HTML syntax.

        Christian Lemburg
        Brainbench MVP for Perl
        http://www.brainbench.com

See also...
by Anonymous Monk on Nov 04, 2000 at 16:11 UTC
    Helo, Bastian (author of webcleaner), bob (filterproxy) and me (aifilter) are currently discussing these things via email. I invite evryone to have a look at bloodgate.com/aifilter (and especially at the documents inside doc/performance of the downloadade source) to see how I solved the problem with HTML::Parser. The parser is fast, but if you need it to re-parse the document 10 times it starts to get real slow. Of course, bob is right, regexps can be faster than the parser. Anyway, suggestions, opinions, comments etc to this site, or via email. Thanx in advance, Tels (bloodgate.com)
Re: A grammar for HTML matching
by elwarren (Priest) on Nov 02, 2000 at 00:45 UTC
    I would use it. I believe what he wants to do is match a code block and extract the content. I have written several small scripts that run via cron to collect some information and file it away. I ended up using two different methods to grab what I wanted.

    The first was just to scan the html looking for a comment line and grabbing most everything after it. That was the easy one.

    The second site was more complicated and the data I was trying to extract was in a large table that changed size depending on what they were displaying. I didn't feel like learning html::parser at the time and I hadn't found html::tableextract either. I cheated and piped the page through lynx and grabbed what I wanted from the parsed text output.

    So neither of those methods would help you :-) but if you put something like this together I would use it. I still have to take a look at html::tableextract, but I'll get around to it.

    I've seen a few packages on freshmeat.net that will snag comic strips off the web and put them somewhere for you. They might have some good techniques for extracting that stuff.

    HTH
RE: A grammar for HTML matching
by AgentM (Curate) on Nov 01, 2000 at 20:56 UTC
    Have you considering filtering out exactly what you DON'T need and are sure of it (let's say first x bytes or up until a special tag) and then using HTML::Parser on the remaining data? That way, you don't need to match any formats in your important search. In general, if you know the bounds of the useful information, you should be able throw away the rest and perform the searching on the relevant data. I certainly agree with you that your method would be most efficient albeit not very flexible. I really see two steps here. Of course, Benchmarks may prove me wrong since HTML::Parser isn't even Perl.
    AgentM Systems nor Nasca Enterprises nor Bone::Easy nor Macperl is responsible for the comments made by AgentM. Remember, you can build any logical system with NOR.
RE: A grammar for HTML matching
by runrig (Abbot) on Nov 01, 2000 at 09:34 UTC
    Update: You are right. That'll teach me to not pay attention ;-)
      You obviously didn't read the entire message.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://39379]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (10)
As of 2024-04-19 08:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found