jqcoffey has asked for the wisdom of the Perl Monks concerning the following question:

Bonjour,

I have been piecing together a Perl/javascript (my question is only on the Perl guts side of things) based UI to Text::Aspell. I am presented with a completed HTML document and tasked with spell checking all of the words (anything between > and <, effectively).

My problem is finding only the words (nothing inside <>'s) and associated byte position in the document. Once I have this, I can relatively easily perform my JS visual transformations on the HTML and then post back the appropriate info to do the actual replacement in Perl.

What I'm struggling with is the regex to use. I've mucked around with

$-[0]
and
$+[0]
, but am now leaning towards a single s/.../function()/eg regex where the function does the dirty work of building the HTML I need to replace a spell checkable word with (just some nonesense).

The same regex needs to be used on both ends (display and final editing before saving in the database).

I really am totally out of starting places on this, as I have been through many iterations of regexes and logic. I'm not even sure if I should be using a regex, but rather a substring in a while loop... any hints, advice, explicit examples would be much appreciated.

Thanks,
Justin

  • Comment on regex for search and replace of words in HTML

Replies are listed 'Best First'.
Re: regex for search and replace of words in HTML
by GrandFather (Saint) on Jun 15, 2005 at 22:14 UTC

    You really, really need to look at HTML::TreeBuilder. The text you want just drops out using as_text()

    use HTML::TreeBuilder; my $html = ...; my $Tree; $Tree = HTML::TreeBuilder->new (); $Tree->parse ($html); $Tree->eof (); my $Text = $Tree->as_text();

    Update: Answer the question!

    The code above gets all the text. Munge that into a word hash. Identify the misspelled words then use a simple regex to pick out the places where the misspelled words are in teh original document.


    Perl is Huffman encoded by design.
      Thanks for pointing me in that direction... definitely a useful module, but my problem still persists. I got as far as HTML::Treebuilder gets me with this code:

      # strip html tags $text =~ s/<[^>]*>//g; # strip special chars $text =~ s/&[^;]*;//g; # shove resulting words into an array my @words = $text =~ /(\w+\'*\w+)/g;

      My problem of being able to find the *exact* instance of a particular word still persists. For example, there might be three occurances of the word, "testy" in a document. The first word might want to be replaced with "test," while the second and third remain "testy." Therefore, I need to treat each word separately.

      Also, on my resultant global search and replace, what if someone has included the words "img src" in plain text and they want that changed to "image source"? That would blow up all of my <img src=> tags. I know it's a contrived situation, but I know our users.... I am currently working on this test script:

      #!/usr/bin/perl use strict; use warnings; my $html = ''; while (<STDIN>) { $html .= $_; } my $begin = 0; my $end = 0; my @excerpts = (); for (my $i=0;$i<length($html);$i++) { if (substr($html,$i,1) eq '>') { $begin = $i + 1; } if ($begin && substr($html,$i,1) eq '<') { $end = $i; } if ($begin && $end) { push @excerpts, { begin => $begin, end => $end }; $begin = 0; $end = 0; } } # last snippet if ($begin && !$end) { push @excerpts, { begin => $begin, end => length($html) }; } foreach my $excerpt (@excerpts) { my $begin = $excerpt->{begin} || 0; my $end = $excerpt->{end} || 0; my $length = $end - $begin; my $word_string = substr($html,$begin,$length); ...still working on search and replaces for $word_string... }
      -Justin

      holli has replaced pre tags with code tags

        You might like this:


        Perl is Huffman encoded by design.
Re: regex for search and replace of words in HTML
by jqcoffey (Novice) on Jun 15, 2005 at 23:49 UTC
    Okay, for what it's worth, I think I've found a reasonable solution to this problem. Comments would be very, very welcome, as I am concerned about the robustness and reliability of it....

    HTML snippet:

    <HTML> <body poop=smelly> <p> This is my text. I <b>hope</b> you like it! <table> <tr> <td> Would you like to see my Monkey? </td> </tr> </table> </body> oh and this too!

    And now the parsing code:

    #!/usr/bin/perl use strict; use warnings; use Data::Dumper; my $html = ''; while (<STDIN>) { $html .= $_; } my $begin = 0; my $end = 0; my @excerpts = (); for (my $i=0;$i<length($html);$i++) { if (substr($html,$i,1) eq '>') { $begin = $i + 1; } if ($begin && substr($html,$i,1) eq '<') { $end = $i; } if ($begin && $end) { push @excerpts, { begin => $begin, end => $end }; $begin = 0; $end = 0; } } # last snippet if ($begin && !$end) { push @excerpts, { begin => $begin, end => length($html) }; } my @word_pos_list = (); foreach my $excerpt (@excerpts) { my $begin = $excerpt->{begin}; my $end = $excerpt->{end}; my $length = $end - $begin; my $word_string = substr($html,$begin,$length); while ($word_string =~ m/(\b\w+\b)/g) { my $word_begin = $begin + $-[0]; my $word_end = $begin + $+[0]; my $word_length = $word_end - $word_begin; push @word_pos_list, { begin => $word_begin, end => $w +ord_end, length => $word_length, word => $1 }; } } print "Original HTML:\n"; print "$html\n"; print "**************************************\n"; print "New HTML:\n"; my $test_repl = $word_pos_list[6]; my $repl_beg = $test_repl->{begin}; my $repl_length = $test_repl->{length}; substr($html,$repl_beg,$repl_length,'POOP'); print $html;

    Thanks!

    -Justin

      Aren't you just reinventing the wheel? What's wrong with grandfather's solution, which was shorter -> more maintainable? Hang on a minute, I'm going to see if I can break your parser with some ugly input...

      UPDATE: OK, here's something that will break. Let's say there's a comment in your html and the comment has a < or >. This will confuse your parser, because it doesn't check for comments:

      So, do like grandfather says and use Treebuilder or one of the HTML::Parser family of modules.

        Yep, yep, yep... you're right. I'm sure there are tons more ways to break my little parsing methodology. So, I took grandfather's suggestion and modified it just a bit. From my devel script:

        use HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new(); $tree->parse($html); $tree->eof(); my %words = (); foreach my $word ($tree->as_text() =~ m/(\b\w+\'?\w+)/g) { $words{$word} += 1; } my @word_pos; my $key; foreach $key (keys %words) { pos($html) = 0; while ($html =~ />[^<]*?(\b$key\b).*?[<\$]/gis) { push @word_pos, [$key, $-[1]]; } }

        MUCH tidier than my solution. The one thing I'm confused about on grandfather's regex is the \$. Should I be concerned with a literal '$' while regexing on the original HTML?

        Thanks,
        Justin