But now taking a step back. I don't really understand what you're trying to do here, big picture... which when you're mucking around with html parsing is often a bad sign. I suspect that it is still breakable.
What you're trying is something along the lines of, match the detagged text with the original document. There are a lot of edge cases here. What happens when you have words that match in the destripped text, which also occurs in the tags? What happens when you have repeated words? Etc.
My gut is that, you should really be doing the spell check within each tag, rather than fetching inside the tags, matching that back up into the original document, and then fixing the original document. That would make the code a heck of a lot easier to read, and understand... and that would be a good sign.
If you're going to stay with the original solution, you need to do a bunch of test cases to make sure you didn't overlook an edge case. If you want help from the monks, you should post your test script(s), so we can try to break it, which like I said, I think is a likelihood. You could do this using <DATA> like I did in my post above.
But I would see if there's some solution that doesn't involve matching back to the original html.
Alternatively, you could comment up your strip to explain better what you're trying to do. I would also use regex comments using the //x syntax.
You're on the right track using HTML::Treebuilder though. Good luck!
In reply to Re^4: regex for search and replace of words in HTML
by tphyahoo
in thread regex for search and replace of words in HTML
by jqcoffey
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |