Perl regex question

anlamarama has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Perl regex question by graff (Chancellor) on Mar 08, 2009 at 17:25 UTC
The previous suggestion to use a module is fine, provided you read and understand the module documentation, so that you know what it's doing for you. But for a case like this, a simple, direct use of the information in perlre, and the description of "qr" in perlop, would do just as well. You want to make sure that you don't get "false-alarm" matches, like having a match on a target word like "part" when the text file contains "compartment". So you want to enclose your list of words to match in parens, surrounded by word-boundaries, like this: `my @target_words = qw/list of words/; # or read from a file, or whate +ver my $joined_targets = join "\|", @target_words; my $match_regex = qr/\b($joined_targets)\b/; for my $file ( @file_list ) { open( F, "<", $file ); while (<F>) { if ( /$match_regex/ ) { my $mathed_word = $1; store_to_db( $file, $matched_word ); last; } } close F; }` [download] Of course, if a target word like "work" is supposed to match on tokens like "works" and/or "worked", you should just make sure your list includes all the appropriate forms for each word.	[reply] [d/l]
Re^2: Perl regex question by repellent (Priest) on Mar 08, 2009 at 18:58 UTC
It may be better to treat the words literally rather than as patterns: `my $joined_targets = join "\|", map { quotemeta } @target_words;` [download]	[reply] [d/l]
Re^3: Perl regex question by graff (Chancellor) on Mar 09, 2009 at 01:33 UTC
As often as not, treating the strings as patterns is more sensible than treating them as literals. It depends on what the programmer wants to accomplish with a particular app, so the programmer should make this a deliberate choice for each app.	[reply]
Re: Perl regex question by codeacrobat (Chaplain) on Mar 08, 2009 at 15:49 UTC
You might want to look at Regexp::Assemble. `perl -MRegexp::Assemble -e ' $ra=Regexp::Assemble->new; $ra->add($_) for qw(list of words); local $/=undef; # slurp mode on $re=$ra->re; # create a regex from your list of words while(<>){ print "$1 in $ARGV\n" if $_ =~ m{($re)} }' list of files` [download] `print+qq(\L@{[ref\&@]}@{['@'x7^'!#2/"!4']});`	[reply] [d/l] [select]
Re^2: Perl regex question by ikegami (Patriarch) on Mar 08, 2009 at 16:04 UTC
Note quite. That should be `$ra=Regexp::Assemble->new; $ra->add(quotemeta($_)) for qw(list of words); $re=$ra->re;` [download] or `$re=Regexp::List->new->list2re(qw(list of words));` [download] since you start with a list of words, not a list of patterns.	[reply] [d/l] [select]
Re: Perl regex question by locked_user sundialsvc4 (Abbot) on Mar 08, 2009 at 21:01 UTC
I generally think that “this is a very appropriate application for a Perl hash.” The process can work by loading the wordlist, thereby initializing the hash, with a zero counter in each bucket. Then run the file, using regular-expressions (or perhaps simply `split`) to isolate each successive word for lookup. Once you've finished, use the final contents of the hash to update your MySQL database. In other words, “there's no reason to do this until after the file has been entirely processed.” The program simply issues `UPDATE` statements for each word whose count (in the hash) is non-zero.
Re: Perl regex question by JavaFan (Canon) on Mar 08, 2009 at 23:06 UTC
For example, is it possible to search with one regex statement and get the matched word? Yes. But I've no idea how why that fact is useful to you. Considering you need to know from every word whether it matches or not, it doesn't help you to know just one of them is matching. Theoretically, you could do something like `/(word1\|word2\|word3\|...){?{ push @matches, $^N })(*FAIL)/` [download] but that will usually not be faster than `/word1/ && ...; /word2/ && ...; /word3/ && ...;` [download]	[reply] [d/l] [select]
Re: Perl regex question by targetsmart (Curate) on Mar 09, 2009 at 05:13 UTC
Have you tried using unix grep utility (if you are working on unix based OS). You can reduce some complexity from your code if you use grep. Vivek -- In accordance with the prarabdha of each, the One whose function it is to ordain makes each to act. What will not happen will never happen, whatever effort one may put forth. And what will happen will not fail to happen, however much one may seek to prevent it. This is certain. The part of wisdom therefore is to stay quiet.	[reply]