Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I am trying to use HTML::Parser to do the following:

1) Parse a string variable $doc which contains the contents of a html file
2) Do a global substitution on the entire text portion (ie the portion not enclosed in tags), so that I can add additional tags to it.

For example, if I want to enclose every instance of foo, regardless of whether or not it is interrupted with a tag, with a bar tag, and $str = 'FOOFO<FOO>Oxxx' I would want the resulting $str to be $str = '<BAR>FOO</BAR><BAR>FO<FOO>O</BAR>xxx'

Could anyone give me a code sample of how to go about this? Thanks

update (broquaint): added formatting + <code> tags

Replies are listed 'Best First'.
Re: Help using HTML::Parser
by Ovid (Cardinal) on Nov 05, 2002 at 23:26 UTC

    You're going to have to be a bit more clear in your specs. Let's say that we find the letters "fo" at the top of the document and about 200 K later, we find the letter "o". Do you want the substitution then? Kind of tough to tell. What about if you run into a word that contains the target letters, such as "fools"? Just wrapping "foo" in tags is easy (untested code follows). The following uses HTML::TokeParser::Simple instead of HTML::Parser.

    use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new( \$original_html ); my $new_html = ''; while ( my $token = $p->get_token ) { unless ($token->is_text) { $new_html .= $token->return_text; } else { my $text = $token->return_text; $text =~ s/foo/<bar>foo</bar>/g; $new_html .= $text; } }

    Cheers,
    Ovid

    Join the Perlmonks Setiathome Group.
    New address of my CGI Course.

      Ovid, Thanks for your response. To answer your questions, 1)I would want the substitution if I had "fo<some 200K length Tag>o", but NOT if I had fo<some tag>q<some tag>o. 2)I would also want "fools" to become "<bar>foo</bar>" The only reason the code you supplied would not work for me is that it would not do the substitution on "fo<some 200K length Tag>o" I would prefer it, if I were able to do:
      while (my $token = $p->get_token ) { ..... if ($token->is_text) { $text .= $token->return_text; } } $text =~ s/foo/<bar>foo</bar>/g;
      That's where I get stuck because I see no way to "merge" my "new" document that contains only <bar> tags AND the original text with the original document, which contained the original text and all the other tags. Hope that clarifies.