Immessurable thanks to the Monks for helping me out on this... All of the comments I have received have been quite useful. To this last, I did in fact find an issue with my last iteration of code. I have taken some of tphyahoo's post to heart and poke around more with existing HTML parsing modules. What follows is, at this point, the best I can come up with for reliably getting byte position for text within an HTML document:

#!/usr/bin/perl use strict; use warnings; use HTML::TokeParser; use Data::Dumper; my $html_file = './test.html'; my $html = ''; open(F,"<$html_file"); while (<F>) { $html .= $_; } close(F); my $word_to_repl = $ARGV[0] || 0; chomp $word_to_repl; my $p = HTML::TokeParser->new( \$html ); # setup text position info for TokeParser. The char is # the token type and the int is the position in the resulting # array of the unmanipulated text--which is what we want to # inspect. my $text_pos = {'S' => 4, 'E' => 2, 'T' => 1, 'C' => 1, 'D' => 1, 'PI' => 2 }; my $base_count = 0; my @word_list = (); while (my $token = $p->get_token) { my $token_type = $token->[0] || ''; my $token_pos = $text_pos->{$token_type} || ''; # die hard if we have any sort of parsing error, as everything # is likely screwed as a result, anyway. if (!$token_type || !$token_pos) { print "Ouch.. parsing error!\n"; exit 0; } if ($token_type eq 'T') { # got text, run a regex with positional counts my $text = $token->[$token_pos]; # regex grabs all words out of $text. It *also* grabs + HTML &nnnn; type # special chars complete with the & and ; so I can ski +p them. The # "\w+\'?\w+" bit allows me to grab contracted words ( +eg don't), but causes # a failure in finding single letter words ("I" and "a +"). while ($text =~ m/(\&?\b\w+\'?\w+?\b\;?)/g) { # skip if this is a &nnnn; style HTML char if ($1 !~ /^\&/) { # start byte is the summation of base_ +count and where # this regex started off. my $start = $base_count + $-[0]; push @word_list, { word => $1, start = +> $start }; } } } # increment base_count with the length of this segment $base_count += length($token->[$token_pos]); } print "Original HTML:\n"; print "----------------------------------\n"; print "$html\n\n"; my $word_href = $word_list[$word_to_repl]; my $start = $word_href->{start}; my $word = $word_href->{word}; my $offset = length($word); print "Replacing [$word] at ($start,$offset)\n\n"; substr($html,$start,$offset,'POOP'); print "New HTML:\n"; print "----------------------------------\n"; print "$html\n\n";

This is just a test script that hasn't been integrated into my devel code just yet. It expects an html file in the pwd called test.html, as written.

Now, to fully explain what the heck I'm trying to do...

I have been tasked with writing a spell checker that functions on a user supplied HTML document. It has a Javascript UI that handles the highlighting of misspelled words, creating popups for offering word suggestions, and keeping track of changes made to the document.

The Perl layer needs to find all words, and their starting byte position. Text::Aspell checks the words and offers suggestions if they are misspelled.

In order for the Javascript I wrote to work, I have to replace each misspelled word with the following:

<SPAN ID=BYTE_POS>original_word</SPAN>

The span id allows me to manipulate the word.

In order to accomplish the search and replace aspect, I build a list of hashes containing all words and their start bytes. I sort and iterate over this list in descending byte position order, which insures that all my starting byte positions remain valid during the replacement operation.

Once the user has made all of the changes to the document, they post back to the Perl script which gets a list of byte positions and replacement words. Again, it will iterate in reverse byte order as it does the replacements and commits the changes to the database.

This is the best strategy I could come up with, and it is darn near functional. I would love it if someone could poke a hole in script I've posted above for finding accurate byte positions, but it does seem to work fairly well.

Thanks,
Justin


In reply to Re^5: regex for search and replace of words in HTML by jqcoffey
in thread regex for search and replace of words in HTML by jqcoffey

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.