in reply to Re: Re: test if a string contains a list member
in thread test if a string contains a list member

While not a point that is particularly relevent in your situation it should be kept in mind that this approach has limitations. It doesnt scale that well because of the way the regex engine works and the simple conversion of the banned list to a regex would have problems with various regex reserved characters, 'SH|T' would blow it for instance.

A more sophisticated approach might be to keep a hash of banned words with associated hand written regexes to match them. On the fly you could either match against each in turn, maximizing the optimizations available to the regex engine. Or more simply cat them all together as you are doing here, but at least you would have the certainty of knowing the regex fragment used would be correct (as you can make it)

Again I relise this might be too much for this particular situation, but its worth considering, you'd be suprised where bugs from this type of approach show up. The other day I was playing with HTML::TableExtract that uses a very similar mechanism to scan for table column headers. It failed very oddly when a parenthesis or | was in the header name. Oddly enough that it took me a while to track down... ;-)

Yves
--
You are not ready to use symrefs unless you already know why they are bad. -- tadmc (CLPM)

  • Comment on Re: Re: Re: test if a string contains a list member

Replies are listed 'Best First'.
Re (tilly) 4: test if a string contains a list member
by tilly (Archbishop) on Oct 21, 2001 at 19:32 UTC
    There is an implementation of that method at RE (tilly) 4: SAS log scanner, along with discussion and benchmarks of various methods in that general thread.
      Heh. So that makes three implementations of generating regexen from tries that I know off, wonder how many more are out there? :-) Theres one on CPAN, I wrote one and then I see yours, suprisingly the author of Tree::Trie didnt notice the application but he has the framework.

      Yves
      --
      You are not ready to use symrefs unless you already know why they are bad. -- tadmc (CLPM)

Re(4): test if a string contains a list member
by mojotoad (Monsignor) on Nov 07, 2002 at 16:52 UTC
    The other day I was playing with HTML::TableExtract that uses a very similar mechanism to scan for table column headers. It failed very oddly when a parenthesis or | was in the header name.

    For what it's worth, I do mention in the TE docs that header strings get turned into case-insensitive regular expression strings...so regexp special characters need to be escaped first.

    Perhaps more insidious, however, is when people are dealing with headers that have one as a substring of another. Order is important in that case. Think m/Hubba|Hubbadandy/ and you'll see what I mean. It's not hard to fix, but I need to patch to issue a warning since ordering of columns is a feature of the module.

    Matt