HTML Search Engine/Parser

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Before I duplicate someone else's work, does anyone know of a good html search engine/parser that is available in perl (or even C/C++)? What I'm looking for is a really well designed a powerful parser that can, among other things, keep a hierarchial list of what tags its current searching through. For example, if I have:

And I want to find all IMG tags, it'll know whether those tags are nested, like above, in a body,table,tr,td.

Any ideas?

Comment on HTML Search Engine/Parser

Replies are listed 'Best First'.
Re: HTML Search Engine/Parser by LD2 (Curate) on Jul 14, 2001 at 08:25 UTC
Why not check cpan.org? You may want to look at: HTML::Parser and/or HTML::TreeBuilder	[reply]
Re: HTML Search Engine/Parser by agent00013 (Pilgrim) on Jul 14, 2001 at 19:52 UTC
HTML::LinkExtor is good for parsing, also. You can use it for links as well as other tags if you do it right. If you look at the grabLinks function in my URL Checking Spider you can see an example of how I used it to grab all the links and images from a web site. (the script spiders through a series of pages, so with some modification and additional functions, you might be able to set it up as a search engine.) I hope that gives you a start, good luck.	[reply]
Re: HTML Search Engine/Parser by MZSanford (Curate) on Jul 14, 2001 at 12:25 UTC
I am in agreemment, LD2 was definitly correct with HTML::Parser ... i have used it for complex HTML parsing and found it to be stable. OH, a sarcasm detector, that’s really useful	[reply]