anlamarama has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I need to solve a problem with perl.

I have lots of text file and have a wordlist with more than 100 words.

I will search every word in every text file, if a word matches (there is only one matched word in every text file) then I will record it to mysql db. So, I will search word1 in text1, then word2 in text1, then word3 in text1 (it matched and I recorded it). After that word1 in text2 word2 in text2 (it matched ...) and so on..

What is the best way to do so? My way is to assign all the words to an array, then search every word by for loop.

Is there any shorter/different way? For example, is it possible to search with one regex statement and get the matched word?

For instance, can I use or (|) character (like word1|word2|word3|word4) and get matched character? If it is possible, how can I get it?

Thanks in advance,

Replies are listed 'Best First'.
Re: Perl regex question
by graff (Chancellor) on Mar 08, 2009 at 17:25 UTC
    The previous suggestion to use a module is fine, provided you read and understand the module documentation, so that you know what it's doing for you.

    But for a case like this, a simple, direct use of the information in perlre, and the description of "qr" in perlop, would do just as well. You want to make sure that you don't get "false-alarm" matches, like having a match on a target word like "part" when the text file contains "compartment". So you want to enclose your list of words to match in parens, surrounded by word-boundaries, like this:

    my @target_words = qw/list of words/; # or read from a file, or whate +ver my $joined_targets = join "|", @target_words; my $match_regex = qr/\b($joined_targets)\b/; for my $file ( @file_list ) { open( F, "<", $file ); while (<F>) { if ( /$match_regex/ ) { my $mathed_word = $1; store_to_db( $file, $matched_word ); last; } } close F; }
    Of course, if a target word like "work" is supposed to match on tokens like "works" and/or "worked", you should just make sure your list includes all the appropriate forms for each word.
      It may be better to treat the words literally rather than as patterns:
      my $joined_targets = join "|", map { quotemeta } @target_words;
        As often as not, treating the strings as patterns is more sensible than treating them as literals. It depends on what the programmer wants to accomplish with a particular app, so the programmer should make this a deliberate choice for each app.
Re: Perl regex question
by codeacrobat (Chaplain) on Mar 08, 2009 at 15:49 UTC
    You might want to look at Regexp::Assemble.
    perl -MRegexp::Assemble -e ' $ra=Regexp::Assemble->new; $ra->add($_) for qw(list of words); local $/=undef; # slurp mode on $re=$ra->re; # create a regex from your list of words while(<>){ print "$1 in $ARGV\n" if $_ =~ m{($re)} }' list of files

    print+qq(\L@{[ref\&@]}@{['@'x7^'!#2/"!4']});

      Note quite. That should be

      $ra=Regexp::Assemble->new; $ra->add(quotemeta($_)) for qw(list of words); $re=$ra->re;

      or

      $re=Regexp::List->new->list2re(qw(list of words));

      since you start with a list of words, not a list of patterns.

Re: Perl regex question
by locked_user sundialsvc4 (Abbot) on Mar 08, 2009 at 21:01 UTC

    I generally think that “this is a very appropriate application for a Perl hash.” The process can work by loading the wordlist, thereby initializing the hash, with a zero counter in each bucket. Then run the file, using regular-expressions (or perhaps simply split) to isolate each successive word for lookup.

    Once you've finished, use the final contents of the hash to update your MySQL database. In other words, “there's no reason to do this until after the file has been entirely processed.” The program simply issues UPDATE statements for each word whose count (in the hash) is non-zero.

Re: Perl regex question
by JavaFan (Canon) on Mar 08, 2009 at 23:06 UTC
    For example, is it possible to search with one regex statement and get the matched word?
    Yes. But I've no idea how why that fact is useful to you. Considering you need to know from every word whether it matches or not, it doesn't help you to know just one of them is matching.

    Theoretically, you could do something like

    /(word1|word2|word3|...){?{ push @matches, $^N })(*FAIL)/
    but that will usually not be faster than
    /word1/ && ...; /word2/ && ...; /word3/ && ...;
Re: Perl regex question
by targetsmart (Curate) on Mar 09, 2009 at 05:13 UTC
    Have you tried using unix grep utility (if you are working on unix based OS). You can reduce some complexity from your code if you use grep.

    Vivek
    -- In accordance with the prarabdha of each, the One whose function it is to ordain makes each to act. What will not happen will never happen, whatever effort one may put forth. And what will happen will not fail to happen, however much one may seek to prevent it. This is certain. The part of wisdom therefore is to stay quiet.