Re: test if a string contains a list member

You can build your regex on the fly:

my @bad = qw(fork smurf);
my $regex = join('|', @bad);
$regex = qr/$regex/;

foreach (<FH>){
    if(/^[0000-9999](.+)/){
        next if /$regex/;
        push( @witty_quotes,  substr( $_, 5) );
    } else { next };
}
[download]

Some other notes: <list>

Your quote is already stored in $1 after the first pattern match, so you don't need the substr().

The else { next }; after the if() is redundant.

/^[0000-9999](.+)/ doesn't do what you think it does. :) [0000-9999] is the same as [0-9].

</list> You can rewrite your code like so:

use strict;

my @witty_quotes;
my @bad = qw(fork smurf);
my $regex = join('|', @bad);
$regex = qr/$regex/;

# open() FH somewhere

foreach (<FH>){
    # Note: it's important to put this regex test
    # outside of the if() block to ensure that $1
    # below comes from the correct pattern match
    next if /$regex/;

    if(/^\d{4}(.+)/){ 
        push( @witty_quotes,  $1);
    }
}
[download]

-Matt

Comment on Re: test if a string contains a list member Select or Download Code

Replies are listed 'Best First'.
Re: Re: test if a string contains a list member by mull (Monk) on Oct 21, 2001 at 02:05 UTC
Thanks again for all the help! I went over my script in light of all the excellent comments, and now I have something I think is much nicer. `my @offensive_words = qw( smurf fork rake ); my $bad = join('\|', @offensive_words); $bad = qr($bad); ... open FH, $filename; while(<FH>){ /($bad)/oi and push( @rude_quotes, $1) and next; /^\d{4}-(.+)/os and push( @witty_quotes, $1 ); }` [download] So now I get what I wanted, and also nice array of rude quotes which is sure to come in handy... Thanks again!	[reply] [d/l]
Re: Re: Re: test if a string contains a list member by demerphq (Chancellor) on Oct 21, 2001 at 17:41 UTC
While not a point that is particularly relevent in your situation it should be kept in mind that this approach has limitations. It doesnt scale that well because of the way the regex engine works and the simple conversion of the banned list to a regex would have problems with various regex reserved characters, 'SH\|T' would blow it for instance. A more sophisticated approach might be to keep a hash of banned words with associated hand written regexes to match them. On the fly you could either match against each in turn, maximizing the optimizations available to the regex engine. Or more simply cat them all together as you are doing here, but at least you would have the certainty of knowing the regex fragment used would be correct (as you can make it) Again I relise this might be too much for this particular situation, but its worth considering, you'd be suprised where bugs from this type of approach show up. The other day I was playing with HTML::TableExtract that uses a very similar mechanism to scan for table column headers. It failed very oddly when a parenthesis or \| was in the header name. Oddly enough that it took me a while to track down... ;-) Yves -- You are not ready to use symrefs unless you already know why they are bad. -- tadmc (CLPM)	[reply]
Re (tilly) 4: test if a string contains a list member by tilly (Archbishop) on Oct 21, 2001 at 19:32 UTC
There is an implementation of that method at RE (tilly) 4: SAS log scanner, along with discussion and benchmarks of various methods in that general thread.	[reply]
Re: Re (tilly) 4: test if a string contains a list member by demerphq (Chancellor) on Oct 21, 2001 at 20:55 UTC
Re(4): test if a string contains a list member by mojotoad (Monsignor) on Nov 07, 2002 at 16:52 UTC
The other day I was playing with HTML::TableExtract that uses a very similar mechanism to scan for table column headers. It failed very oddly when a parenthesis or \| was in the header name. For what it's worth, I do mention in the TE docs that header strings get turned into case-insensitive regular expression strings...so regexp special characters need to be escaped first. Perhaps more insidious, however, is when people are dealing with headers that have one as a substring of another. Order is important in that case. Think `m/Hubba\|Hubbadandy/` and you'll see what I mean. It's not hard to fix, but I need to patch to issue a warning since ordering of columns is a feature of the module. Matt	[reply] [d/l]