regex for search and replace of words in HTML

jqcoffey has asked for the wisdom of the Perl Monks concerning the following question:

Bonjour,

I have been piecing together a Perl/javascript (my question is only on the Perl guts side of things) based UI to Text::Aspell. I am presented with a completed HTML document and tasked with spell checking all of the words (anything between > and <, effectively).

My problem is finding only the words (nothing inside <>'s) and associated byte position in the document. Once I have this, I can relatively easily perform my JS visual transformations on the HTML and then post back the appropriate info to do the actual replacement in Perl.

What I'm struggling with is the regex to use. I've mucked around with

$-[0]

and

$+[0]

, but am now leaning towards a single s/.../function()/eg regex where the function does the dirty work of building the HTML I need to replace a spell checkable word with (just some nonesense).

The same regex needs to be used on both ends (display and final editing before saving in the database).

I really am totally out of starting places on this, as I have been through many iterations of regexes and logic. I'm not even sure if I should be using a regex, but rather a substring in a while loop... any hints, advice, explicit examples would be much appreciated.

Thanks,
Justin

Comment on regex for search and replace of words in HTML

Replies are listed 'Best First'.
Re: regex for search and replace of words in HTML by GrandFather (Saint) on Jun 15, 2005 at 22:14 UTC
You really, really need to look at HTML::TreeBuilder. The text you want just drops out using as_text() `use HTML::TreeBuilder; my $html = ...; my $Tree; $Tree = HTML::TreeBuilder->new (); $Tree->parse ($html); $Tree->eof (); my $Text = $Tree->as_text();` [download] Update: Answer the question! The code above gets all the text. Munge that into a word hash. Identify the misspelled words then use a simple regex to pick out the places where the misspelled words are in teh original document. Perl is Huffman encoded by design.	[reply] [d/l]
Re^2: regex for search and replace of words in HTML by jqcoffey (Novice) on Jun 15, 2005 at 23:23 UTC
Thanks for pointing me in that direction... definitely a useful module, but my problem still persists. I got as far as HTML::Treebuilder gets me with this code: `# strip html tags $text =~ s/<[^>]>//g; # strip special chars $text =~ s/&[^;];//g; # shove resulting words into an array my @words = $text =~ /(\w+\'\w+)/g;` [download] My problem of being able to find the exact* instance of a particular word still persists. For example, there might be three occurances of the word, "testy" in a document. The first word might want to be replaced with "test," while the second and third remain "testy." Therefore, I need to treat each word separately. Also, on my resultant global search and replace, what if someone has included the words "img src" in plain text and they want that changed to "image source"? That would blow up all of my <img src=> tags. I know it's a contrived situation, but I know our users.... I am currently working on this test script: #!/usr/bin/perl use strict; use warnings; my $html = ''; while (<STDIN>) { $html .= $_; } my $begin = 0; my $end = 0; my @excerpts = (); for (my $i=0;$i<length($html);$i++) { if (substr($html,$i,1) eq '>') { $begin = $i + 1; } if ($begin && substr($html,$i,1) eq '<') { $end = $i; } if ($begin && $end) { push @excerpts, { begin => $begin, end => $end }; $begin = 0; $end = 0; } } # last snippet if ($begin && !$end) { push @excerpts, { begin => $begin, end => length($html) }; } foreach my $excerpt (@excerpts) { my $begin = $excerpt->{begin} \|\| 0; my $end = $excerpt->{end} \|\| 0; my $length = $end - $begin; my $word_string = substr($html,$begin,$length); ...still working on search and replaces for $word_string... } [download] -Justin holli has replaced pre tags with code tags	[reply] [d/l] [select]
Re^3: regex for search and replace of words in HTML by GrandFather (Saint) on Jun 16, 2005 at 03:05 UTC
You might like this: Read more... the code (1141 Bytes) Read more... the output (556 Bytes) Perl is Huffman encoded by design.	[reply] [d/l] [select]
Re^4: regex for search and replace of words in HTML by jqcoffey (Novice) on Jun 16, 2005 at 22:13 UTC
Re: regex for search and replace of words in HTML by jqcoffey (Novice) on Jun 15, 2005 at 23:49 UTC
Okay, for what it's worth, I think I've found a reasonable solution to this problem. Comments would be very, very welcome, as I am concerned about the robustness and reliability of it.... HTML snippet: `<HTML> <body poop=smelly> <p> This is my text. I <b>hope</b> you like it! <table> <tr> <td> Would you like to see my Monkey? </td> </tr> </table> </body> oh and this too!` [download] And now the parsing code: #!/usr/bin/perl use strict; use warnings; use Data::Dumper; my $html = ''; while (<STDIN>) { $html .= $_; } my $begin = 0; my $end = 0; my @excerpts = (); for (my $i=0;$i<length($html);$i++) { if (substr($html,$i,1) eq '>') { $begin = $i + 1; } if ($begin && substr($html,$i,1) eq '<') { $end = $i; } if ($begin && $end) { push @excerpts, { begin => $begin, end => $end }; $begin = 0; $end = 0; } } # last snippet if ($begin && !$end) { push @excerpts, { begin => $begin, end => length($html) }; } my @word_pos_list = (); foreach my $excerpt (@excerpts) { my $begin = $excerpt->{begin}; my $end = $excerpt->{end}; my $length = $end - $begin; my $word_string = substr($html,$begin,$length); while ($word_string =~ m/(\b\w+\b)/g) { my $word_begin = $begin + $-[0]; my $word_end = $begin + $+[0]; my $word_length = $word_end - $word_begin; push @word_pos_list, { begin => $word_begin, end => $w +ord_end, length => $word_length, word => $1 }; } } print "Original HTML:\n"; print "$html\n"; print "**************************************\n"; print "New HTML:\n"; my $test_repl = $word_pos_list[6]; my $repl_beg = $test_repl->{begin}; my $repl_length = $test_repl->{length}; substr($html,$repl_beg,$repl_length,'POOP'); print $html; [download] Thanks! -Justin	[reply] [d/l] [select]
Re^2: regex for search and replace of words in HTML by tphyahoo (Vicar) on Jun 16, 2005 at 08:35 UTC
Aren't you just reinventing the wheel? What's wrong with grandfather's solution, which was shorter -> more maintainable? Hang on a minute, I'm going to see if I can break your parser with some ugly input... UPDATE: OK, here's something that will break. Let's say there's a comment in your html and the comment has a < or >. This will confuse your parser, because it doesn't check for comments: So, do like grandfather says and use Treebuilder or one of the HTML::Parser family of modules. Read more... (2 kB)	[reply] [d/l]
Re^3: regex for search and replace of words in HTML by jqcoffey (Novice) on Jun 16, 2005 at 23:45 UTC
Yep, yep, yep... you're right. I'm sure there are tons more ways to break my little parsing methodology. So, I took grandfather's suggestion and modified it just a bit. From my devel script: `use HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new(); $tree->parse($html); $tree->eof(); my %words = (); foreach my $word ($tree->as_text() =~ m/(\b\w+\'?\w+)/g) { $words{$word} += 1; } my @word_pos; my $key; foreach $key (keys %words) { pos($html) = 0; while ($html =~ />[^<]?(\b$key\b).?[<\$]/gis) { push @word_pos, [$key, $-[1]]; } }` [download] MUCH tidier than my solution. The one thing I'm confused about on grandfather's regex is the `\$`. Should I be concerned with a literal '$' while regexing on the original HTML? Thanks, Justin	[reply] [d/l] [select]
Re^4: regex for search and replace of words in HTML by tphyahoo (Vicar) on Jun 17, 2005 at 08:10 UTC
Re^5: regex for search and replace of words in HTML by jqcoffey (Novice) on Jun 21, 2005 at 21:57 UTC
Some notes below your chosen depth have not been shown here