wbushey has asked for the wisdom of the Perl Monks concerning the following question:

Hello Esteemed Monks of Perl,

I have a problem that I have been working on for a couple of days. I'm a new comer to Perl, so I may be missing something obvious, but my searching through the web and documentation has not produced a definitive answer on this.

My problem is I have some string of text ($text) that I know is somewhere in a page of HTML. What I would like to do is search though the HTML for $text and find out what the lowest level element is that contains $text, and be able to find out the ancestors for that element.

I feel like I am in the right arena with HTML::TreeBuilder and HTML::TokeParser, since TreeBuilder maintains the relationships I am looking for while TokeParser make it easy to search for text, but I feel like I need some combination of the two. I also feel like I might be overlooking something obvious in one or the other (I guess I have a lot of feelings.)

  • Comment on Finding the DOM element via text search

Replies are listed 'Best First'.
Re: Finding the DOM element via text search
by jrsimmon (Hermit) on Jul 02, 2009 at 14:47 UTC
Re: Finding the DOM element via text search
by Anonymous Monk on Jul 02, 2009 at 14:45 UTC
Re: Finding the DOM element via text search
by wfsp (Abbot) on Jul 03, 2009 at 10:36 UTC
    Here's my go. It recursively builds a list of "ancestors" for each node using $h->lineage. The $h->objectify_text allows text nodes to have ancestors too. When it finds any matching text it prints out the list.
    #!/usr/bin/perl use warnings; use strict; use HTML::TreeBuilder; my $h = HTML::TreeBuilder->new_from_content( do{local $/;<DATA>}, ); $h->objectify_text; my $text = q{text}; walk($h, $text); sub walk{ my $h = shift; my $text = shift; for my $ele ($h->content_list) { my @lineage = $ele->lineage; my @ancestors; for my $ancestor (reverse @lineage){ push @ancestors, $ancestor->tag; } if ( $ele->tag eq q{~text} and $ele->attr(q{text}) and $ele->attr(q{text}) eq $text ) { printf( qq{%s\t}, $_ ) for @ancestors; printf( qq{found *%s* at depth %s\n}, $ele->attr(q{text}), scalar @ancestors ); } walk($ele, $text); } } __DATA__ <html><head><title>search</title></head> <body> <p>text</p> <div> <p>text</p> </div> </body></html>
    html body p found *text* at depth 3 html body div p found *text* at depth 4