Re^5: regex for search and replace of words in HTML

Immessurable thanks to the Monks for helping me out on this... All of the comments I have received have been quite useful. To this last, I did in fact find an issue with my last iteration of code. I have taken some of tphyahoo's post to heart and poke around more with existing HTML parsing modules. What follows is, at this point, the best I can come up with for reliably getting byte position for text within an HTML document:

#!/usr/bin/perl

use strict;
use warnings;
use HTML::TokeParser;
use Data::Dumper;


my $html_file = './test.html';

my $html = '';
open(F,"<$html_file");
while (<F>) {
        $html .= $_;
}
close(F);

my $word_to_repl = $ARGV[0] || 0;
chomp $word_to_repl;

my $p = HTML::TokeParser->new( \$html );

# setup text position info for TokeParser.  The char is
# the token type and the int is the position in the resulting
# array of the unmanipulated text--which is what we want to
# inspect.
my $text_pos = {'S'     => 4,
                'E'     => 2,
                'T'     => 1,
                'C'     => 1,
                'D'     => 1,
                'PI'    => 2 };

my $base_count = 0;
my @word_list = ();
while (my $token = $p->get_token) {
        my $token_type = $token->[0] || '';
        my $token_pos  = $text_pos->{$token_type} || '';

        # die hard if we have any sort of parsing error, as everything
        # is likely screwed as a result, anyway.
        if (!$token_type || !$token_pos) {
                print "Ouch.. parsing error!\n";
                exit 0;
        }

        if ($token_type eq 'T') {
                # got text, run a regex with positional counts
                my $text = $token->[$token_pos];

                # regex grabs all words out of $text.  It *also* grabs
+ HTML &nnnn; type
                # special chars complete with the & and ; so I can ski
+p them.  The
                # "\w+\'?\w+" bit allows me to grab contracted words (
+eg don't), but causes
                # a failure in finding single letter words ("I" and "a
+").
                while ($text =~ m/(\&?\b\w+\'?\w+?\b\;?)/g) {
                        # skip if this is a &nnnn; style HTML char
                        if ($1 !~ /^\&/) {
                                # start byte is the summation of base_
+count and where
                                # this regex started off.
                                my $start = $base_count + $-[0];
                                push @word_list, { word => $1, start =
+> $start };
                        }
                }
        }

        # increment base_count with the length of this segment
        $base_count += length($token->[$token_pos]);
}


print "Original HTML:\n";
print "----------------------------------\n";
print "$html\n\n";

my $word_href   = $word_list[$word_to_repl];
my $start       = $word_href->{start};
my $word        = $word_href->{word};
my $offset      = length($word);

print "Replacing [$word] at ($start,$offset)\n\n";
substr($html,$start,$offset,'POOP');

print "New HTML:\n";
print "----------------------------------\n";
print "$html\n\n";
[download]

This is just a test script that hasn't been integrated into my devel code just yet. It expects an html file in the pwd called test.html, as written.

Now, to fully explain what the heck I'm trying to do...

I have been tasked with writing a spell checker that functions on a user supplied HTML document. It has a Javascript UI that handles the highlighting of misspelled words, creating popups for offering word suggestions, and keeping track of changes made to the document.

The Perl layer needs to find all words, and their starting byte position. Text::Aspell checks the words and offers suggestions if they are misspelled.

In order for the Javascript I wrote to work, I have to replace each misspelled word with the following:

<SPAN ID=BYTE_POS>original_word</SPAN>

The span id allows me to manipulate the word.

In order to accomplish the search and replace aspect, I build a list of hashes containing all words and their start bytes. I sort and iterate over this list in descending byte position order, which insures that all my starting byte positions remain valid during the replacement operation.

Once the user has made all of the changes to the document, they post back to the Perl script which gets a list of byte positions and replacement words. Again, it will iterate in reverse byte order as it does the replacements and commits the changes to the database.

This is the best strategy I could come up with, and it is darn near functional. I would love it if someone could poke a hole in script I've posted above for finding accurate byte positions, but it does seem to work fairly well.

Thanks,
Justin

Comment on Re^5: regex for search and replace of words in HTML Select or Download Code