Regexp explanation

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Regexp explanation by Aragorn (Curate) on Apr 02, 2004 at 20:33 UTC
The regex you're using is matching a "word character", i.e. any alphanumeric character and underscore, or a single quote, and as much of them in a row as possible. If you want it to match also a dash, you should say so: `$string =~ /((\w\|'\|-)+)/g` [download] But that's bit un-regex-like. You should use a character class in this situation: `$string =~ /([-\w']+)/g` [download] But this also matches dashes and apostrophes at the beginnings or ends of words, which may or may not be what you want. If not, you could force that a dash or apostrophe is between alphanumeric characters: `$string =~ /(\w+[-']?\w+)/g` [download] But this has the unwanted effect that a word is at least 2 characters. So we can add an alternation, saying that we also allow a single character word (or number) if we can't match a word consisting of an alphanumerics with a dash or apostrophe between them: `$string =~ /(\w+[-']?\w+\|\w)/g` [download] A small complete test-case: `#!/usr/local/bin/perl use strict; use warnings; $/ = undef; my $string = <DATA>; while ($string =~ /(\w+[-']?\w+\|\w)/g) { print "Word: <$1>\n"; } __DATA__ This is a sentence with words that're different from other words. They have apostrophes in them (') and dashes, or dash-like characters (-).` [download] Try running this code and compare the output with the sentence in the `__DATA__` section. The ultimate guide (in my opinion) on regular expressions is Jeffrey Friedl's Mastering Regular Expressions, 2nd Edition. Arjen	[reply] [d/l] [select]
Re: Re: Regexp explanation by Anonymous Monk on Apr 02, 2004 at 20:57 UTC
Thanks, I'll run the code now. There are so many possible combination of word into a single composed word, that I need to test every case.	[reply]
Re: Regexp explanation by QM (Parson) on Apr 02, 2004 at 20:20 UTC
`while ($string =~ /((\w\|')+)/g) {` [download] `(\w\|')` matches either a word char or a single quote. Because the left paren here is the 2nd left paren in the regex, this captures into `$2`. `((\w\|')+)` matches one or more occurences of `(\w\|')`. Because of the first left paren, this captures into `$1`. `$2` will be the same as the last char of `$1`. `//g` in a scalar context returns the next match, so the `while` will continue until all matches are found. If you just want to add hypenated words, try this: `while ($string =~ /(([\w'-])+)/g) {...do something...}` [download] Update: To get everything surrounded by whitespace, use `while ($string =~ /((\S)+)/g) {...do something...}` [download] but this captures more than `\w` was doing. Or use split. Or see perlre. -QM -- Quantum Mechanics: The dreams stuff is made of	[reply] [d/l] [select]
Re: Regexp explanation by duff (Parson) on Apr 02, 2004 at 20:25 UTC
You might as well use a character class since you aren't being too discriminatory about the words you match (i.e., your RE will also match a "word" like `foo'bar'baz`) `while ($string =~ /([\w'-]+)/g) { ... }` [download] Note that you have to put the dash character either as the first or last character in the character class so that it isn't interpretted as a range duff	[reply] [d/l] [select]
Re: Regexp explanation by matija (Priest) on Apr 02, 2004 at 20:11 UTC
Your regexp would accept any combination of letters (would be A-Za-z in ASCII, but it depends on your locale), underline, and single quote. If you want it to allow words in quotes, and possibly surrounded by quotes, you could change it to `/('?[\w-]+'?)/g` [download]	[reply] [d/l]
Re: Re: Regexp explanation by Anonymous Monk on Apr 02, 2004 at 20:17 UTC
I want my string to accept composed word which have a character "-" between them such as "tear-down" or by-the-house; I don't need to be surrounded by quotes... maybe to get everything separate by blanks in a sentence?	[reply]
Re: Re: Re: Regexp explanation by matija (Priest) on Apr 02, 2004 at 21:02 UTC
So then why did you put the quotes in your regex? Never mind. It is important to remember that the Perl regexps are greedy: they will match as much as possible. Therefore to match a word, you only need to specify what a word is, there is no need to define what it's delimiters might be. That simplifies the regexp to: `/([\w-]+)/g` [download]	[reply] [d/l]
Re: Regexp explanation by Anonymous Monk on Apr 02, 2004 at 20:32 UTC
Thanks, it is working About my RE, I did not mention that I already removed all the punctuation marks such as '"?/!, etc Thanks monks	[reply]
Re: Regexp explanation by eXile (Priest) on Apr 03, 2004 at 17:54 UTC
In addition to all answers you already have, in understanding/learning regex-en YAPE::Regex::Explain can be very useful: use YAPE::Regex::Explain; print YAPE::Regex::Explain->new(qr/((\w\|')+)/)->explain(); __DATA__ The regular expression: (?-imsx:((\w\|')+)) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- ( group and capture to \1: ---------------------------------------------------------------------- ( group and capture to \2 (1 or more times (matching the most amount possible)): ---------------------------------------------------------------------- \w word characters (a-z, A-Z, 0-9, _) ---------------------------------------------------------------------- \| OR ---------------------------------------------------------------------- ' '\'' ---------------------------------------------------------------------- )+ end of \2 (NOTE: because you're using a quantifier on this capture, only the LAST repetition of the captured pattern will be stored in \2) ---------------------------------------------------------------------- ) end of \1 ---------------------------------------------------------------------- ) end of grouping ---------------------------------------------------------------------- [download]	[reply] [d/l]