Nodonomy has asked for the wisdom of the Perl Monks concerning the following question:

I am having trouble constructing regular expressions and would like to ask your help on this one: Given text containing HTML numbers of the form #8220; and &, what would be a regexp (or regexps) to strip these from the text?

Replies are listed 'Best First'.
Re: regexp question
by Anonymous Monk on Jan 29, 2011 at 02:40 UTC

      I have used HTML:Entities encode_entities($text) to convert all the quotes, ampersands, and the like in my text in order to be able to do an 'insert' of the text into MySQL.

      Now my problem is how to search on this text (or a subset of it) with all the HTML numbers in it. In other words, how can I search on "Apples & Oranges" when the actual text now in the table is "Apples & Oranges"?

      I thought maybe I should just remove the HTML numbers (such as #8220; and &) and settle for that.

      I feel a little sheepish in asking this, since it seems something that I should know. In fact though this seems like a problem many people have, and that there probably is some consistent ways to address it. Unfortunately I'm not familiar with such an approach.

        how can I search on "Apples & Oranges" when

        You really can't. "A" could be stored as any of

        • A
        • A
        • A
        • A
        • ...
        • A
        • A
        • A
        • ...

        You'd need to decode the text to search it.

        the actual text now in the table is "Apples & Oranges"?

        Why?

        (How did you even reach this point from asking how to decode HTML text into plain text?)

      Hey, that module uses regex :p
Re: regexp question
by elef (Friar) on Jan 29, 2011 at 17:17 UTC
    So why don't you
    1) do your searches before encoding
    or
    2) decode to some temporary location for the purpose of searching
    or
    3) encode your search expression with the same encoder and procedure as well

    ...?
      I believe #3 is the correct way.
        No, it's a lot of needless work.