Re^4: Regex word boundries

Thank you. That seemed to do the trick.

I was wondering if you could possibly explain the regex you have used. I am trying now to identify one occurance of the term in a line of text so that I can work out the inverse document frequency (IDF).

So far I have worked out that you are looking for the term, using a non-capturing means (?:pattern), i.e. (?:\W). I haven't a clue what this actually does, nor about the part after \E ..... (?:(?=\W).

I know that the (?=\W) is a regex to look-ahead of a non-word, but not sure what the outer ?: is doing.

cheers,
MonkPaul.

Comment on Re^4: Regex word boundries

Replies are listed 'Best First'.
Re^5: Regex word boundries by ikegami (Patriarch) on Oct 29, 2007 at 15:55 UTC
Non-capturing parens (`(?:...)`) are just like parens (`(...)`) in Perl code. The alter precedence. `# Matches strings that include "ab" or "cd" /ab\|cd/ # Matches strings that include "abd" or "acd". /a(?:b\|c)d/` [download] `(?=...)` performs a match, but leaves `pos` unchanged after the match. `local $_ = 'foo bar bar baz'; my $term = 'bar'; my $num_matches = () = /(?:\W\|^)\Q$term\E(?:\W\|\z)/g; print("$num_matches\n"); # 1: foo[ bar ]bar baz my $num_matches = () = /(?:\W\|^)\Q$term\E(?:(?=\W)\|\z)/g; print("$num_matches\n"); # 2: foo[ bar][ bar] baz` [download]	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^5: Regex word boundries
by ikegami (Patriarch) on Oct 29, 2007 at 15:55 UTC

Non-capturing parens ((?:...)) are just like parens ((...)) in Perl code. The alter precedence.

# Matches strings that include "ab" or "cd"
/ab|cd/

# Matches strings that include "abd" or "acd".
/a(?:b|c)d/
[download]

(?=...) performs a match, but leaves pos unchanged after the match.

local $_ = 'foo bar bar baz';
my $term = 'bar';

my $num_matches = () = /(?:\W|^)\Q$term\E(?:\W|\z)/g;
print("$num_matches\n");
# 1: foo[ bar ]bar baz

my $num_matches = () = /(?:\W|^)\Q$term\E(?:(?=\W)|\z)/g;
print("$num_matches\n");
# 2: foo[ bar][ bar] baz
[download]

[reply]
[d/l]
[select]