The previous suggestion to use a module is fine, provided you read and understand the module documentation, so that you know what it's doing for you.
But for a case like this, a simple, direct use of the information in perlre, and the description of "qr" in perlop, would do just as well. You want to make sure that you don't get "false-alarm" matches, like having a match on a target word like "part" when the text file contains "compartment". So you want to enclose your list of words to match in parens, surrounded by word-boundaries, like this:
my @target_words = qw/list of words/; # or read from a file, or whate
+ver
my $joined_targets = join "|", @target_words;
my $match_regex = qr/\b($joined_targets)\b/;
for my $file ( @file_list ) {
open( F, "<", $file );
while (<F>) {
if ( /$match_regex/ ) {
my $mathed_word = $1;
store_to_db( $file, $matched_word );
last;
}
}
close F;
}
Of course, if a target word like "work" is supposed to match on tokens like "works" and/or "worked", you should just make sure your list includes all the appropriate forms for each word. | [reply] [d/l] |
It may be better to treat the words literally rather than as patterns:
my $joined_targets = join "|", map { quotemeta } @target_words;
| [reply] [d/l] |
As often as not, treating the strings as patterns is more sensible than treating them as literals. It depends on what the programmer wants to accomplish with a particular app, so the programmer should make this a deliberate choice for each app.
| [reply] |
perl -MRegexp::Assemble -e '
$ra=Regexp::Assemble->new;
$ra->add($_) for qw(list of words);
local $/=undef; # slurp mode on
$re=$ra->re; # create a regex from your list of words
while(<>){
print "$1 in $ARGV\n" if $_ =~ m{($re)}
}' list of files
print+qq(\L@{[ref\&@]}@{['@'x7^'!#2/"!4']});
| [reply] [d/l] [select] |
$ra=Regexp::Assemble->new;
$ra->add(quotemeta($_)) for qw(list of words);
$re=$ra->re;
or
$re=Regexp::List->new->list2re(qw(list of words));
since you start with a list of words, not a list of patterns. | [reply] [d/l] [select] |
I generally think that “this is a very appropriate application for a Perl hash.” The process can work by loading the wordlist, thereby initializing the hash, with a zero counter in each bucket. Then run the file, using regular-expressions (or perhaps simply split) to isolate each successive word for lookup.
Once you've finished, use the final contents of the hash to update your MySQL database. In other words, “there's no reason to do this until after the file has been entirely processed.” The program simply issues UPDATE statements for each word whose count (in the hash) is non-zero.
| |
For example, is it possible to search with one regex statement and get the matched word?
Yes. But I've no idea how why that fact is useful to you. Considering you need to know from every word whether it matches or not, it doesn't help you to know just one of them is matching.
Theoretically, you could do something like
/(word1|word2|word3|...){?{ push @matches, $^N })(*FAIL)/
but that will usually not be faster than
/word1/ && ...;
/word2/ && ...;
/word3/ && ...;
| [reply] [d/l] [select] |
Have you tried using unix grep utility (if you are working on unix based OS). You can reduce some complexity from your code if you use grep.
Vivek
-- In accordance with the prarabdha of each, the One whose function it is to ordain makes each to act. What will not happen will never happen, whatever effort one may put forth. And what will happen will not fail to happen, however much one may seek to prevent it. This is certain. The part of wisdom therefore is to stay quiet.
| [reply] |