Re: Regex word boundries

You're trying to match /\bmy term 3\b/ against each "word" (groups of non-space characters) of line.
Even if the line contains my term 3,
'my' =~ /\bmy term 3\b/ will be false,
'term' =~ /\bmy term 3\b/ will be false, and
'3' =~ /\bmy term 3\b/ will be false.

Just eliminate the foreach my $word (@word_array) loop, leaving its body in place.

Other problems:

$word_count isn't reset to 0 for each term like it should.

You don't check if the two arguments were provided.

Some style tips for better readability:

@array_1 and @array_2 are meaningless names.

$pathway_abstracts and $terms are not much better. There's not even a hint that these are mearly the names of the file that contain the information instead of the information itself.

for (my $j = 0; $j < @array_2; $j++)
is less readable than
for my $j (0 .. $#array_2)
In this case, you don't even need $j, so I'd recommend
for my $term (@array_2)

Same for the $i loop.
for my $line (@array_1)

The second chomp does nothing. The first chomp already did the deed. Remember, $key in for my $key (@array) is a alias to the array element. Any change to $key will affect the array element to which it is linked.

The placement of chomp is odd. I'd move it to where the array is read in. Change
my @array_2 = <IN2>;
to
chomp( my @array_2 = <IN2> );

Why have two loops going iterating over @array_2? Move
$term_score{$key} = 0;
into the second loop.

$x = $x + $y;
can be written much more simply as
$x += $y;

scalar isn't needed when already in scalar context.

Simplified solution:

#!/usr/bin/perl -w

use strict;
use warnings;

use Getopt::Long qw( GetOptions );

# Get user input.
my $abstracts_file;
my $terms_file;
GetOptions(
   "pathway_abstracts=s" => \$abstracts_file,
   "terms_file=s"        => \$terms_file,
);
# ...Needs error checking here...

# Load pathway abstract.
my $file;
{
  open(my $fh, '<', $abstracts_file)
    or die("Unable to open abstracts file \"$abstracts_file\": $!\n");
  local $/;
  $file = <$fh>;
}

# Load terms into array.
my @terms;
{
  open(my $fh, '<', $terms_file)
    or die("Unable to open terms file \"$terms_file\": $!\n");
  chomp( @terms = <$fh> );
}

print("Term\t| ");
print("Number\t| ");
print("Frequency\n");
print("----------------------------------------------------\n");

# Find out how many words are in abstracts.
my $word_count = () = split(' ', $file);

for my $term (@terms) {
  # Count the number of times the search term matches.
  my $score = () = $file =~ /\b\Q$term\E\b/g;
  my $freq = $score / $word_count;
  print("$term\t$score\t$freq\n");
}
[download]

Update: Added more tips.
Update: Added solution.

Comment on Re: Regex word boundries Select or Download Code

Replies are listed 'Best First'.
Re^2: Regex word boundries by MonkPaul (Friar) on Oct 19, 2007 at 12:40 UTC
Thanks. I changed my code after you pointed out what was wrong. One thing I did notice in your code (and after runnning) was that the word count seems to be just 1. I changed it back to what I originally had so that it properly refers to ~78,000. `my @word_count = (); @word_count = split(/\s/, $file); # Find out how many words are i +n abstracts. my $word_number = scalar(@word_count);` [download] I now have another problem in that some terms are still not picked up. I think this is because they contain special characters and a combination of upper and lowercase letters. I may be wrong. These terms include: `adenosine-5'-triphosphate levels 0 h(2)O(2) 0 MPP+ 0 -dichlorophenyl)-1,1-dimethylurea 0 adenosine-5'-triphosphate synthesis 0 photosynthesis, the antioxidant enzyme activities of superoxide dismut +ase (superoxide dismuase) (EC 0 bcl-X(L) 0 ca2+ 0 adenosine-5'-triphosphate production 0 ca(2+) 0 mitochondrial phospholipid hydroperoxide glutathione photosynthesis, t +he antioxidant enzyme activities of SOD (superoxide dismuase) (EC +0 bcl-x(L) 0 deltapsi(m) 0 pirin(Sm) 0 rho(0) 0` [download] ...where the 0 represents the number of times the word was matched. These should all be 1+, as I initially got this data from the text file (via web service). Any ideas as to how to resove this. I thought maybe using some escape character, but, have no idea how to integrate that into my original regex. MonkPaul	[reply] [d/l] [select]
Re^3: Regex word boundries by ikegami (Patriarch) on Oct 19, 2007 at 13:25 UTC
One thing I did notice in your code (and after runnning) was that the word count seems to be just 1. Sorry, `my $word_count = () = split(' ', $file);` should be `my $word_count = split(' ', $file);` I now have another problem in that some terms are still not picked up `\b` matches between `\w\W`, `\W\w`, `^\w` and `\w\z`. As such, the second `\b` won't match in `'h(2)O(2) water' =~ '/\b\Qh(2)O(2)\Q\b/`. (`)` is a `\W`, and so is the following space.) Perhaps this will do the trick: `/(?:\W\|^)\Q$term\E(?:(?=\W)\|\z)/` [download] I think the following would be faster, but it would count a repeated term as one: `/(?:\W\|^)\Q$term\E(?:\W\|\z)/` [download] If you want the match to be case-insensitive, one solution is to use the `i` modifier on your match.	[reply] [d/l] [select]
Re^4: Regex word boundries by MonkPaul (Friar) on Oct 29, 2007 at 15:24 UTC
Thank you. That seemed to do the trick. I was wondering if you could possibly explain the regex you have used. I am trying now to identify one occurance of the term in a line of text so that I can work out the inverse document frequency (IDF). So far I have worked out that you are looking for the term, using a non-capturing means (?:pattern), i.e. (?:\W). I haven't a clue what this actually does, nor about the part after \E ..... (?:(?=\W). I know that the (?=\W) is a regex to look-ahead of a non-word, but not sure what the outer ?: is doing. cheers, MonkPaul.	[reply]
Re^5: Regex word boundries by ikegami (Patriarch) on Oct 29, 2007 at 15:55 UTC