in reply to Re: Regex word boundries
in thread Regex word boundries

Thanks. I changed my code after you pointed out what was wrong.

One thing I did notice in your code (and after runnning) was that the word count seems to be just 1. I changed it back to what I originally had so that it properly refers to ~78,000.

my @word_count = (); @word_count = split(/\s/, $file); # Find out how many words are i +n abstracts. my $word_number = scalar(@word_count);

I now have another problem in that some terms are still not picked up. I think this is because they contain special characters and a combination of upper and lowercase letters. I may be wrong.
These terms include:

adenosine-5'-triphosphate levels 0 h(2)O(2) 0 MPP+ 0 -dichlorophenyl)-1,1-dimethylurea 0 adenosine-5'-triphosphate synthesis 0 photosynthesis, the antioxidant enzyme activities of superoxide dismut +ase (superoxide dismuase) (EC 0 bcl-X(L) 0 ca2+ 0 adenosine-5'-triphosphate production 0 ca(2+) 0 mitochondrial phospholipid hydroperoxide glutathione photosynthesis, t +he antioxidant enzyme activities of SOD (superoxide dismuase) (EC +0 bcl-x(L) 0 deltapsi(m) 0 pirin(Sm) 0 rho(0) 0

...where the 0 represents the number of times the word was matched. These should all be 1+, as I initially got this data from the text file (via web service).

Any ideas as to how to resove this. I thought maybe using some escape character, but, have no idea how to integrate that into my original regex.

MonkPaul

Replies are listed 'Best First'.
Re^3: Regex word boundries
by ikegami (Patriarch) on Oct 19, 2007 at 13:25 UTC

    One thing I did notice in your code (and after runnning) was that the word count seems to be just 1.

    Sorry,
    my $word_count = () = split(' ', $file);
    should be
    my $word_count = split(' ', $file);

    I now have another problem in that some terms are still not picked up

    \b matches between \w\W, \W\w, ^\w and \w\z. As such, the second \b won't match in 'h(2)O(2) water' =~ '/\b\Qh(2)O(2)\Q\b/. () is a \W, and so is the following space.) Perhaps this will do the trick:

    /(?:\W|^)\Q$term\E(?:(?=\W)|\z)/

    I think the following would be faster, but it would count a repeated term as one:

    /(?:\W|^)\Q$term\E(?:\W|\z)/

    If you want the match to be case-insensitive, one solution is to use the i modifier on your match.

      Thank you. That seemed to do the trick.

      I was wondering if you could possibly explain the regex you have used. I am trying now to identify one occurance of the term in a line of text so that I can work out the inverse document frequency (IDF).

      So far I have worked out that you are looking for the term, using a non-capturing means (?:pattern), i.e. (?:\W). I haven't a clue what this actually does, nor about the part after \E ..... (?:(?=\W).

      I know that the (?=\W) is a regex to look-ahead of a non-word, but not sure what the outer ?: is doing.

      cheers,
      MonkPaul.

        • Non-capturing parens ((?:...)) are just like parens ((...)) in Perl code. The alter precedence.

          # Matches strings that include "ab" or "cd" /ab|cd/ # Matches strings that include "abd" or "acd". /a(?:b|c)d/
        • (?=...) performs a match, but leaves pos unchanged after the match.

          local $_ = 'foo bar bar baz'; my $term = 'bar'; my $num_matches = () = /(?:\W|^)\Q$term\E(?:\W|\z)/g; print("$num_matches\n"); # 1: foo[ bar ]bar baz my $num_matches = () = /(?:\W|^)\Q$term\E(?:(?=\W)|\z)/g; print("$num_matches\n"); # 2: foo[ bar][ bar] baz