Re: regex for search and replace of words in HTML

You really, really need to look at HTML::TreeBuilder. The text you want just drops out using as_text()

use HTML::TreeBuilder;

my $html = ...;
my $Tree;

$Tree = HTML::TreeBuilder->new ();
$Tree->parse ($html);
$Tree->eof ();

my $Text = $Tree->as_text();
[download]

Update: Answer the question!

The code above gets all the text. Munge that into a word hash. Identify the misspelled words then use a simple regex to pick out the places where the misspelled words are in teh original document.

Perl is Huffman encoded by design.

Comment on Re: regex for search and replace of words in HTML Download Code

Replies are listed 'Best First'.
Re^2: regex for search and replace of words in HTML by jqcoffey (Novice) on Jun 15, 2005 at 23:23 UTC
Thanks for pointing me in that direction... definitely a useful module, but my problem still persists. I got as far as HTML::Treebuilder gets me with this code: `# strip html tags $text =~ s/<[^>]>//g; # strip special chars $text =~ s/&[^;];//g; # shove resulting words into an array my @words = $text =~ /(\w+\'\w+)/g;` [download] My problem of being able to find the exact* instance of a particular word still persists. For example, there might be three occurances of the word, "testy" in a document. The first word might want to be replaced with "test," while the second and third remain "testy." Therefore, I need to treat each word separately. Also, on my resultant global search and replace, what if someone has included the words "img src" in plain text and they want that changed to "image source"? That would blow up all of my <img src=> tags. I know it's a contrived situation, but I know our users.... I am currently working on this test script: #!/usr/bin/perl use strict; use warnings; my $html = ''; while (<STDIN>) { $html .= $_; } my $begin = 0; my $end = 0; my @excerpts = (); for (my $i=0;$i<length($html);$i++) { if (substr($html,$i,1) eq '>') { $begin = $i + 1; } if ($begin && substr($html,$i,1) eq '<') { $end = $i; } if ($begin && $end) { push @excerpts, { begin => $begin, end => $end }; $begin = 0; $end = 0; } } # last snippet if ($begin && !$end) { push @excerpts, { begin => $begin, end => length($html) }; } foreach my $excerpt (@excerpts) { my $begin = $excerpt->{begin} \|\| 0; my $end = $excerpt->{end} \|\| 0; my $length = $end - $begin; my $word_string = substr($html,$begin,$length); ...still working on search and replaces for $word_string... } [download] -Justin holli has replaced pre tags with code tags	[reply] [d/l] [select]
Re^3: regex for search and replace of words in HTML by GrandFather (Saint) on Jun 16, 2005 at 03:05 UTC
You might like this: Read more... the code (1141 Bytes) Read more... the output (556 Bytes) Perl is Huffman encoded by design.	[reply] [d/l] [select]
Re^4: regex for search and replace of words in HTML by jqcoffey (Novice) on Jun 16, 2005 at 22:13 UTC
Welp, that's it, and much more reliable... I usually try to use HTML parsing modules in my code, but couldn't get around that positional road block and tried to roll my own. As an aside, this: `<img src="image.jpg" alt="<><><><">` Does quite a handy job of fouling up my "parser" as well. Much appreciated. Thanks, Justin	[reply] [d/l]