in reply to regex for search and replace of words in HTML

Okay, for what it's worth, I think I've found a reasonable solution to this problem. Comments would be very, very welcome, as I am concerned about the robustness and reliability of it....

HTML snippet:

<HTML> <body poop=smelly> <p> This is my text. I <b>hope</b> you like it! <table> <tr> <td> Would you like to see my Monkey? </td> </tr> </table> </body> oh and this too!

And now the parsing code:

#!/usr/bin/perl use strict; use warnings; use Data::Dumper; my $html = ''; while (<STDIN>) { $html .= $_; } my $begin = 0; my $end = 0; my @excerpts = (); for (my $i=0;$i<length($html);$i++) { if (substr($html,$i,1) eq '>') { $begin = $i + 1; } if ($begin && substr($html,$i,1) eq '<') { $end = $i; } if ($begin && $end) { push @excerpts, { begin => $begin, end => $end }; $begin = 0; $end = 0; } } # last snippet if ($begin && !$end) { push @excerpts, { begin => $begin, end => length($html) }; } my @word_pos_list = (); foreach my $excerpt (@excerpts) { my $begin = $excerpt->{begin}; my $end = $excerpt->{end}; my $length = $end - $begin; my $word_string = substr($html,$begin,$length); while ($word_string =~ m/(\b\w+\b)/g) { my $word_begin = $begin + $-[0]; my $word_end = $begin + $+[0]; my $word_length = $word_end - $word_begin; push @word_pos_list, { begin => $word_begin, end => $w +ord_end, length => $word_length, word => $1 }; } } print "Original HTML:\n"; print "$html\n"; print "**************************************\n"; print "New HTML:\n"; my $test_repl = $word_pos_list[6]; my $repl_beg = $test_repl->{begin}; my $repl_length = $test_repl->{length}; substr($html,$repl_beg,$repl_length,'POOP'); print $html;

Thanks!

-Justin

Replies are listed 'Best First'.
Re^2: regex for search and replace of words in HTML
by tphyahoo (Vicar) on Jun 16, 2005 at 08:35 UTC
    Aren't you just reinventing the wheel? What's wrong with grandfather's solution, which was shorter -> more maintainable? Hang on a minute, I'm going to see if I can break your parser with some ugly input...

    UPDATE: OK, here's something that will break. Let's say there's a comment in your html and the comment has a < or >. This will confuse your parser, because it doesn't check for comments:

    So, do like grandfather says and use Treebuilder or one of the HTML::Parser family of modules.

      Yep, yep, yep... you're right. I'm sure there are tons more ways to break my little parsing methodology. So, I took grandfather's suggestion and modified it just a bit. From my devel script:

      use HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new(); $tree->parse($html); $tree->eof(); my %words = (); foreach my $word ($tree->as_text() =~ m/(\b\w+\'?\w+)/g) { $words{$word} += 1; } my @word_pos; my $key; foreach $key (keys %words) { pos($html) = 0; while ($html =~ />[^<]*?(\b$key\b).*?[<\$]/gis) { push @word_pos, [$key, $-[1]]; } }

      MUCH tidier than my solution. The one thing I'm confused about on grandfather's regex is the \$. Should I be concerned with a literal '$' while regexing on the original HTML?

      Thanks,
      Justin

        I don't get the literal dollar in the square bracket either.

        But now taking a step back. I don't really understand what you're trying to do here, big picture... which when you're mucking around with html parsing is often a bad sign. I suspect that it is still breakable.

        What you're trying is something along the lines of, match the detagged text with the original document. There are a lot of edge cases here. What happens when you have words that match in the destripped text, which also occurs in the tags? What happens when you have repeated words? Etc.

        My gut is that, you should really be doing the spell check within each tag, rather than fetching inside the tags, matching that back up into the original document, and then fixing the original document. That would make the code a heck of a lot easier to read, and understand... and that would be a good sign.

        If you're going to stay with the original solution, you need to do a bunch of test cases to make sure you didn't overlook an edge case. If you want help from the monks, you should post your test script(s), so we can try to break it, which like I said, I think is a likelihood. You could do this using <DATA> like I did in my post above.

        But I would see if there's some solution that doesn't involve matching back to the original html.

        Alternatively, you could comment up your strip to explain better what you're trying to do. I would also use regex comments using the //x syntax.

        You're on the right track using HTML::Treebuilder though. Good luck!