I don't get the literal dollar in the square bracket either.

But now taking a step back. I don't really understand what you're trying to do here, big picture... which when you're mucking around with html parsing is often a bad sign. I suspect that it is still breakable.

What you're trying is something along the lines of, match the detagged text with the original document. There are a lot of edge cases here. What happens when you have words that match in the destripped text, which also occurs in the tags? What happens when you have repeated words? Etc.

My gut is that, you should really be doing the spell check within each tag, rather than fetching inside the tags, matching that back up into the original document, and then fixing the original document. That would make the code a heck of a lot easier to read, and understand... and that would be a good sign.

If you're going to stay with the original solution, you need to do a bunch of test cases to make sure you didn't overlook an edge case. If you want help from the monks, you should post your test script(s), so we can try to break it, which like I said, I think is a likelihood. You could do this using <DATA> like I did in my post above.

But I would see if there's some solution that doesn't involve matching back to the original html.

Alternatively, you could comment up your strip to explain better what you're trying to do. I would also use regex comments using the //x syntax.

You're on the right track using HTML::Treebuilder though. Good luck!


In reply to Re^4: regex for search and replace of words in HTML by tphyahoo
in thread regex for search and replace of words in HTML by jqcoffey

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.