MonkPaul has asked for the wisdom of the Perl Monks concerning the following question:

Hello,
It has been a while, so I hope you all have a little patience.

I have written a short script to do a little text mining. I have one file that contains a list of words/terms (new line separated), and a second file that contains a fairly substantial amount of text. I then want to take each term from the first file in turn and count the number of occurances of that term in the second file.

The output would be something like:

Term Number ---------------------- term1 10 term2 1 my term3 16

I am not bothered about the output as yet. The problem I do have, however, is that I am using a regular expression that does not seem to work for a term that consists of multiple words, i.e.

term1 term2 my term 3

The code I have at the moment is:

#!/usr/bin/perl -w use warnings; use strict; use Getopt::Long; my $terms = ""; my $pathway_abstracts = ""; my %word_counts; my $count = 0; my %term_score; my $term_frequency_score = 0; my $word_number = 0; ##################### GET USER INPUT ################################# +########## GetOptions( "pathway_abstracts=s" => \$pathway_abstracts, "terms_file=s" => \$terms ); ################### STORE IN ARRAYS ################################## +########## # store pathway abstract in arrays open(IN, "$pathway_abstracts" ) || die "$!"; my @array_1 = <IN>; close(IN); # store terms in array open(IN2, "$terms" ) || die "$!"; my @array_2 = <IN2>; close(IN2); #################### CREATE HASHES OF TERMS ########################## +########## foreach my $key (@array_2) # assign a score of 0 to each term { chomp($key); $term_score{$key} = 0; } ###################################################################### +########## print("Term\t| "); print("Number\t| "); print("Frequency\n"); print("----------------------------------------------------\n"); for (my $j = 0; $j < @array_2; $j++) # loop through each search + term { chomp($array_2[$j]); my $phenotype_term = $array_2[$j]; # set the search term for(my $i = 0; $i < @array_1; $i++) # loop through each line +in the document { my @word_array = split(/\s/, $array_1[$i]); # split abstra +cts on each word $word_number = $word_number + scalar(@word_array); # find out + how many words are in abstracts foreach my $word (@word_array) # look through each word i +n current line { if($word =~ /\b\Q$phenotype_term\E\b/) # does line contain fi +lter term { $term_score{$array_2[$j]} = $term_score{$array_2[$j]} + 1; +# increment term count } } } $term_frequency_score = $term_score{$array_2[$j]} / $word_number; +# calculate term frequency print($array_2[$j]."\t "); # print the term print($term_score{$array_2[$j]}."\t "); # print the number o +f term occurances print($term_frequency_score."\n"); # print the frequenc +y }

Could anyone please let me know why the \b or \B do not appear to work here in the regualr expression:
$word =~ /\b\Q$phenotype_term\E\b/

Any help is very much appreciated.
many thanks,
MonkPaul.

Replies are listed 'Best First'.
Re: Regex word boundries
by ikegami (Patriarch) on Oct 18, 2007 at 21:11 UTC

    You're trying to match /\bmy term 3\b/ against each "word" (groups of non-space characters) of line.
    Even if the line contains my term 3,
    'my'   =~ /\bmy term 3\b/ will be false,
    'term' =~ /\bmy term 3\b/ will be false, and
    '3'    =~ /\bmy term 3\b/ will be false.

    Just eliminate the foreach my $word (@word_array) loop, leaving its body in place.


    Other problems:

    $word_count isn't reset to 0 for each term like it should.

    You don't check if the two arguments were provided.


    Some style tips for better readability:

    @array_1 and @array_2 are meaningless names.

    $pathway_abstracts and $terms are not much better. There's not even a hint that these are mearly the names of the file that contain the information instead of the information itself.

    for (my $j = 0; $j < @array_2; $j++)
    is less readable than
    for my $j (0 .. $#array_2)
    In this case, you don't even need $j, so I'd recommend
    for my $term (@array_2)

    Same for the $i loop.
    for my $line (@array_1)

    The second chomp does nothing. The first chomp already did the deed. Remember, $key in for my $key (@array) is a alias to the array element. Any change to $key will affect the array element to which it is linked.

    The placement of chomp is odd. I'd move it to where the array is read in. Change
    my @array_2 = <IN2>;
    to
    chomp( my @array_2 = <IN2> );

    Why have two loops going iterating over @array_2? Move
    $term_score{$key} = 0;
    into the second loop.

    $x = $x + $y;
    can be written much more simply as
    $x += $y;

    scalar isn't needed when already in scalar context.


    Simplified solution:

    #!/usr/bin/perl -w use strict; use warnings; use Getopt::Long qw( GetOptions ); # Get user input. my $abstracts_file; my $terms_file; GetOptions( "pathway_abstracts=s" => \$abstracts_file, "terms_file=s" => \$terms_file, ); # ...Needs error checking here... # Load pathway abstract. my $file; { open(my $fh, '<', $abstracts_file) or die("Unable to open abstracts file \"$abstracts_file\": $!\n"); local $/; $file = <$fh>; } # Load terms into array. my @terms; { open(my $fh, '<', $terms_file) or die("Unable to open terms file \"$terms_file\": $!\n"); chomp( @terms = <$fh> ); } print("Term\t| "); print("Number\t| "); print("Frequency\n"); print("----------------------------------------------------\n"); # Find out how many words are in abstracts. my $word_count = () = split(' ', $file); for my $term (@terms) { # Count the number of times the search term matches. my $score = () = $file =~ /\b\Q$term\E\b/g; my $freq = $score / $word_count; print("$term\t$score\t$freq\n"); }

    Update: Added more tips.
    Update: Added solution.

      Thanks. I changed my code after you pointed out what was wrong.

      One thing I did notice in your code (and after runnning) was that the word count seems to be just 1. I changed it back to what I originally had so that it properly refers to ~78,000.

      my @word_count = (); @word_count = split(/\s/, $file); # Find out how many words are i +n abstracts. my $word_number = scalar(@word_count);

      I now have another problem in that some terms are still not picked up. I think this is because they contain special characters and a combination of upper and lowercase letters. I may be wrong.
      These terms include:

      adenosine-5'-triphosphate levels 0 h(2)O(2) 0 MPP+ 0 -dichlorophenyl)-1,1-dimethylurea 0 adenosine-5'-triphosphate synthesis 0 photosynthesis, the antioxidant enzyme activities of superoxide dismut +ase (superoxide dismuase) (EC 0 bcl-X(L) 0 ca2+ 0 adenosine-5'-triphosphate production 0 ca(2+) 0 mitochondrial phospholipid hydroperoxide glutathione photosynthesis, t +he antioxidant enzyme activities of SOD (superoxide dismuase) (EC +0 bcl-x(L) 0 deltapsi(m) 0 pirin(Sm) 0 rho(0) 0

      ...where the 0 represents the number of times the word was matched. These should all be 1+, as I initially got this data from the text file (via web service).

      Any ideas as to how to resove this. I thought maybe using some escape character, but, have no idea how to integrate that into my original regex.

      MonkPaul

        One thing I did notice in your code (and after runnning) was that the word count seems to be just 1.

        Sorry,
        my $word_count = () = split(' ', $file);
        should be
        my $word_count = split(' ', $file);

        I now have another problem in that some terms are still not picked up

        \b matches between \w\W, \W\w, ^\w and \w\z. As such, the second \b won't match in 'h(2)O(2) water' =~ '/\b\Qh(2)O(2)\Q\b/. () is a \W, and so is the following space.) Perhaps this will do the trick:

        /(?:\W|^)\Q$term\E(?:(?=\W)|\z)/

        I think the following would be faster, but it would count a repeated term as one:

        /(?:\W|^)\Q$term\E(?:\W|\z)/

        If you want the match to be case-insensitive, one solution is to use the i modifier on your match.

Re: Regex word boundries
by duff (Parson) on Oct 18, 2007 at 21:08 UTC

    You've split your string by a single white space character and then are looking for multi-word items in your array of single words. You probably want to use the regular expresson on the un-split string.

Re: Regex word boundries
by narainhere (Monk) on Oct 19, 2007 at 08:39 UTC
    here you go dude...You can refer it before you frustrate.BTW, I didn't go through your code, but I would suggest "start small and sustain effort" approach....Just do the regex in a seperate file and test for all the case's.When something that matches exactly raises use in wherever you want...

    The world is so big for any individual to conquer