Regex word boundries

MonkPaul has asked for the wisdom of the Perl Monks concerning the following question:

Hello,
It has been a while, so I hope you all have a little patience.

I have written a short script to do a little text mining. I have one file that contains a list of words/terms (new line separated), and a second file that contains a fairly substantial amount of text. I then want to take each term from the first file in turn and count the number of occurances of that term in the second file.

The output would be something like:

Term         Number
----------------------
term1          10
term2           1
my term3       16
[download]

I am not bothered about the output as yet. The problem I do have, however, is that I am using a regular expression that does not seem to work for a term that consists of multiple words, i.e.

term1
term2
my term 3
[download]

The code I have at the moment is:

#!/usr/bin/perl -w

use warnings;
use strict;
use Getopt::Long;

my $terms = "";
my $pathway_abstracts = "";
my %word_counts;
my $count = 0;
my %term_score;
my $term_frequency_score = 0;
my $word_number = 0;


##################### GET USER INPUT #################################
+##########

GetOptions( "pathway_abstracts=s" => \$pathway_abstracts,
            "terms_file=s" => \$terms
          );


################### STORE IN ARRAYS ##################################
+##########
# store pathway abstract in arrays
open(IN, "$pathway_abstracts" ) || die "$!";
my @array_1 = <IN>;
close(IN);

# store terms in array
open(IN2, "$terms" ) || die "$!";
my @array_2 = <IN2>;
close(IN2);


#################### CREATE HASHES OF TERMS ##########################
+##########
foreach my $key (@array_2)    # assign a score of 0 to each term
{
  chomp($key);
  $term_score{$key} = 0;
}


######################################################################
+##########

print("Term\t| ");
print("Number\t| ");
print("Frequency\n");
print("----------------------------------------------------\n");


for (my $j = 0; $j < @array_2; $j++)        # loop through each search
+ term
{
  chomp($array_2[$j]);
  my $phenotype_term = $array_2[$j];         # set the search term

  for(my $i = 0; $i < @array_1; $i++)        # loop through each line 
+in the document
  {
    my @word_array = split(/\s/, $array_1[$i]);         # split abstra
+cts on each word
    $word_number = $word_number + scalar(@word_array);      # find out
+ how many words are in abstracts

    foreach my $word (@word_array)          # look through each word i
+n current line
    {
      if($word =~ /\b\Q$phenotype_term\E\b/)   #  does line contain fi
+lter term
      {
        $term_score{$array_2[$j]} = $term_score{$array_2[$j]} + 1;    
+# increment term count
      }
    }
  }
  $term_frequency_score = $term_score{$array_2[$j]} / $word_number;   
+# calculate term frequency
  
  print($array_2[$j]."\t  ");                     # print the term
  print($term_score{$array_2[$j]}."\t  ");        # print the number o
+f term occurances
  print($term_frequency_score."\n");              # print the frequenc
+y
}
[download]

Could anyone please let me know why the \b or \B do not appear to work here in the regualr expression:
$word =~ /\b\Q$phenotype_term\E\b/

Any help is very much appreciated.
many thanks,
MonkPaul.

Comment on Regex word boundries Select or Download Code

Replies are listed 'Best First'.
Re: Regex word boundries by ikegami (Patriarch) on Oct 18, 2007 at 21:11 UTC
You're trying to match `/\bmy term 3\b/` against each "word" (groups of non-space characters) of line. Even if the line contains `my term 3`, `'my' =~ /\bmy term 3\b/` will be false, `'term' =~ /\bmy term 3\b/` will be false, and `'3' =~ /\bmy term 3\b/` will be false. Just eliminate the `foreach my $word (@word_array)` loop, leaving its body in place. Other problems: `$word_count` isn't reset to 0 for each term like it should. You don't check if the two arguments were provided. Some style tips for better readability: `@array_1` and `@array_2` are meaningless names. `$pathway_abstracts` and `$terms` are not much better. There's not even a hint that these are mearly the names of the file that contain the information instead of the information itself. `for (my $j = 0; $j < @array_2; $j++)` is less readable than `for my $j (0 .. $#array_2)` In this case, you don't even need `$j`, so I'd recommend `for my $term (@array_2)` Same for the `$i` loop. `for my $line (@array_1)` The second `chomp` does nothing. The first `chomp` already did the deed. Remember, `$key` in `for my $key (@array)` is a alias to the array element. Any change to `$key` will affect the array element to which it is linked. The placement of `chomp` is odd. I'd move it to where the array is read in. Change `my @array_2 = <IN2>;` to `chomp( my @array_2 = <IN2> );` Why have two loops going iterating over `@array_2`? Move `$term_score{$key} = 0;` into the second loop. `$x = $x + $y;` can be written much more simply as `$x += $y;` `scalar` isn't needed when already in scalar context. Simplified solution: #!/usr/bin/perl -w use strict; use warnings; use Getopt::Long qw( GetOptions ); # Get user input. my $abstracts_file; my $terms_file; GetOptions( "pathway_abstracts=s" => \$abstracts_file, "terms_file=s" => \$terms_file, ); # ...Needs error checking here... # Load pathway abstract. my $file; { open(my $fh, '<', $abstracts_file) or die("Unable to open abstracts file \"$abstracts_file\": $!\n"); local $/; $file = <$fh>; } # Load terms into array. my @terms; { open(my $fh, '<', $terms_file) or die("Unable to open terms file \"$terms_file\": $!\n"); chomp( @terms = <$fh> ); } print("Term\t\| "); print("Number\t\| "); print("Frequency\n"); print("----------------------------------------------------\n"); # Find out how many words are in abstracts. my $word_count = () = split(' ', $file); for my $term (@terms) { # Count the number of times the search term matches. my $score = () = $file =~ /\b\Q$term\E\b/g; my $freq = $score / $word_count; print("$term\t$score\t$freq\n"); } [download] Update: Added more tips. Update: Added solution.	[reply] [d/l] [select]
Re^2: Regex word boundries by MonkPaul (Friar) on Oct 19, 2007 at 12:40 UTC
Thanks. I changed my code after you pointed out what was wrong. One thing I did notice in your code (and after runnning) was that the word count seems to be just 1. I changed it back to what I originally had so that it properly refers to ~78,000. `my @word_count = (); @word_count = split(/\s/, $file); # Find out how many words are i +n abstracts. my $word_number = scalar(@word_count);` [download] I now have another problem in that some terms are still not picked up. I think this is because they contain special characters and a combination of upper and lowercase letters. I may be wrong. These terms include: `adenosine-5'-triphosphate levels 0 h(2)O(2) 0 MPP+ 0 -dichlorophenyl)-1,1-dimethylurea 0 adenosine-5'-triphosphate synthesis 0 photosynthesis, the antioxidant enzyme activities of superoxide dismut +ase (superoxide dismuase) (EC 0 bcl-X(L) 0 ca2+ 0 adenosine-5'-triphosphate production 0 ca(2+) 0 mitochondrial phospholipid hydroperoxide glutathione photosynthesis, t +he antioxidant enzyme activities of SOD (superoxide dismuase) (EC +0 bcl-x(L) 0 deltapsi(m) 0 pirin(Sm) 0 rho(0) 0` [download] ...where the 0 represents the number of times the word was matched. These should all be 1+, as I initially got this data from the text file (via web service). Any ideas as to how to resove this. I thought maybe using some escape character, but, have no idea how to integrate that into my original regex. MonkPaul	[reply] [d/l] [select]
Re^3: Regex word boundries by ikegami (Patriarch) on Oct 19, 2007 at 13:25 UTC
One thing I did notice in your code (and after runnning) was that the word count seems to be just 1. Sorry, `my $word_count = () = split(' ', $file);` should be `my $word_count = split(' ', $file);` I now have another problem in that some terms are still not picked up `\b` matches between `\w\W`, `\W\w`, `^\w` and `\w\z`. As such, the second `\b` won't match in `'h(2)O(2) water' =~ '/\b\Qh(2)O(2)\Q\b/`. (`)` is a `\W`, and so is the following space.) Perhaps this will do the trick: `/(?:\W\|^)\Q$term\E(?:(?=\W)\|\z)/` [download] I think the following would be faster, but it would count a repeated term as one: `/(?:\W\|^)\Q$term\E(?:\W\|\z)/` [download] If you want the match to be case-insensitive, one solution is to use the `i` modifier on your match.	[reply] [d/l] [select]
Re^4: Regex word boundries by MonkPaul (Friar) on Oct 29, 2007 at 15:24 UTC
Re^5: Regex word boundries by ikegami (Patriarch) on Oct 29, 2007 at 15:55 UTC
Re: Regex word boundries by duff (Parson) on Oct 18, 2007 at 21:08 UTC
You've split your string by a single white space character and then are looking for multi-word items in your array of single words. You probably want to use the regular expresson on the un-split string. duff	[reply]
Re: Regex word boundries by narainhere (Monk) on Oct 19, 2007 at 08:39 UTC
here you go dude...You can refer it before you frustrate.BTW, I didn't go through your code, but I would suggest "start small and sustain effort" approach....Just do the regex in a seperate file and test for all the case's.When something that matches exactly raises use in wherever you want... The world is so big for any individual to conquer	[reply]

The world is so big for any individual to conquer