Re^2: Regex word boundries

Thanks. I changed my code after you pointed out what was wrong.

One thing I did notice in your code (and after runnning) was that the word count seems to be just 1. I changed it back to what I originally had so that it properly refers to ~78,000.

my @word_count = ();
@word_count = split(/\s/, $file);      # Find out how many words are i
+n abstracts.
my $word_number = scalar(@word_count);
[download]

I now have another problem in that some terms are still not picked up. I think this is because they contain special characters and a combination of upper and lowercase letters. I may be wrong.
These terms include:

adenosine-5'-triphosphate levels    0
h(2)O(2)    0
MPP+    0
-dichlorophenyl)-1,1-dimethylurea    0
adenosine-5'-triphosphate synthesis    0
photosynthesis, the antioxidant enzyme activities of superoxide dismut
+ase (superoxide dismuase) (EC    0
bcl-X(L)    0
ca2+    0
adenosine-5'-triphosphate production    0
ca(2+)    0
mitochondrial phospholipid hydroperoxide glutathione photosynthesis, t
+he antioxidant enzyme activities of SOD (superoxide dismuase) (EC    
+0
bcl-x(L)    0
deltapsi(m)    0
pirin(Sm)    0
rho(0)    0
[download]

...where the 0 represents the number of times the word was matched. These should all be 1+, as I initially got this data from the text file (via web service).

Any ideas as to how to resove this. I thought maybe using some escape character, but, have no idea how to integrate that into my original regex.

MonkPaul

Comment on Re^2: Regex word boundries Select or Download Code

Replies are listed 'Best First'.
Re^3: Regex word boundries by ikegami (Patriarch) on Oct 19, 2007 at 13:25 UTC
One thing I did notice in your code (and after runnning) was that the word count seems to be just 1. Sorry, `my $word_count = () = split(' ', $file);` should be `my $word_count = split(' ', $file);` I now have another problem in that some terms are still not picked up `\b` matches between `\w\W`, `\W\w`, `^\w` and `\w\z`. As such, the second `\b` won't match in `'h(2)O(2) water' =~ '/\b\Qh(2)O(2)\Q\b/`. (`)` is a `\W`, and so is the following space.) Perhaps this will do the trick: `/(?:\W\|^)\Q$term\E(?:(?=\W)\|\z)/` [download] I think the following would be faster, but it would count a repeated term as one: `/(?:\W\|^)\Q$term\E(?:\W\|\z)/` [download] If you want the match to be case-insensitive, one solution is to use the `i` modifier on your match.	[reply] [d/l] [select]
Re^4: Regex word boundries by MonkPaul (Friar) on Oct 29, 2007 at 15:24 UTC
Thank you. That seemed to do the trick. I was wondering if you could possibly explain the regex you have used. I am trying now to identify one occurance of the term in a line of text so that I can work out the inverse document frequency (IDF). So far I have worked out that you are looking for the term, using a non-capturing means (?:pattern), i.e. (?:\W). I haven't a clue what this actually does, nor about the part after \E ..... (?:(?=\W). I know that the (?=\W) is a regex to look-ahead of a non-word, but not sure what the outer ?: is doing. cheers, MonkPaul.	[reply]
Re^5: Regex word boundries by ikegami (Patriarch) on Oct 29, 2007 at 15:55 UTC
Non-capturing parens (`(?:...)`) are just like parens (`(...)`) in Perl code. The alter precedence. `# Matches strings that include "ab" or "cd" /ab\|cd/ # Matches strings that include "abd" or "acd". /a(?:b\|c)d/` [download] `(?=...)` performs a match, but leaves `pos` unchanged after the match. `local $_ = 'foo bar bar baz'; my $term = 'bar'; my $num_matches = () = /(?:\W\|^)\Q$term\E(?:\W\|\z)/g; print("$num_matches\n"); # 1: foo[ bar ]bar baz my $num_matches = () = /(?:\W\|^)\Q$term\E(?:(?=\W)\|\z)/g; print("$num_matches\n"); # 2: foo[ bar][ bar] baz` [download]	[reply] [d/l] [select]