in reply to Parsing Issue while using Variable name in Pattern
(?(?<=\w)(?!\w)|(?=\w)) # \b equivalent
The corresponding definition for \B is:
(?(?<=\w)(?=\w)|(?!\w)) # \B equivalent
If, like most compassionate human beings, you prefer your regexes written for legibility and maintainability, you would write those in /x mode:
Which should now presumably be more scrutable.# \b equivalent: (?(?<= \w) # if there is a word character left (?! \w) # then there must be no word character right | (?= \w) # else there must be a word character right ) # \B equivalent: (?(?<= \w) # if there is a word character left (?= \w) # then there must be a word character right | (?! \w) # else there must be no word character right )
Please note how both \b and \B alike are defined solely in terms of \w characters. There is absolutely no mention of \W in either of those definitions, let alone of ^ or $. This catches many people by surprise.
Now that you know exactly how word boundaries and nonboundaries work, you can craft your own boundaries by swapping in your own condition for wherever you see \w in the patterns above. You just need to be careful to specify a fixed-width condition so that it can be used in a lookbehind. That means you can’t use things like \X or \R, which are variable-width. The easiest way to do that is to use a property or other character class. For example, you could use \p{Greek} for characters in the Greek script—but best add Inherited so you don’t miss the combining characters, so use [\p{Greek}\p{Inherited}] instead.
For example, this might provide regex subroutines suitable for that kind or work:
(?(DEFINE) (?<greeklish> [\p{Greek}\p{Inherited}] ) (?<ungreeklish> [^\p{Greek}\p{Inherited}] ) (?<greek_boundary> (?(?<= (?&greeklish)) (?! (?&greeklish)) | (?= (?&greeklish)) ) ) (?<greek_nonboundary> (?(?<= (?&greeklish)) (?= (?&greeklish)) | (?! (?&greeklish)) ) ) )
For character classes that are the result of adding, subtracting, negating, and intersecting existing Unicode properties, like the greeklish regex subroutine is above, you might prefer to implement these as custom properties. Custom properties look just like normal properties. For example:
sub IsGreeklish { return <<'END'; +utf8::IsGreek +utf8::IsInherited END }
Now you may use \p{IsGreeklish} and \P{IsGreeklish} in patterns compiled in the same package as that subroutine.
Perhaps the most common custom boundary that people want to craft is the one that they thought that \b was doing all all along — but which as has just been demonstrated, is not.
That is, they want a custom boundary that asserts that they are touching either whitespace or the edge of the string, in whichever direction makes sense there.
# space boundary (?(?<= \S) # if there is a nonspace character left (?! \S) # then there must be no space character right | (?= \S) # else there must be a space character right )
To show how that version operates, consider this:
#!/usr/bin/env perl use v5.10; use strict; use warnings; my $space_edge = qr{ (?(?<= \S) # if there is a nonspace character left (?! \S) # then there must be no space character rig +ht | (?= \S) # else there must be a space character rig +ht ) }x; while (<DATA>) { my $edges = 0; for my $str ( "foo", "()" ) { my $qstr = quotemeta($str); unless (/$qstr/) { #print "MISSING $str in: $_"; next; } if (/${space_edge}${qstr}/) { $edges++; print "$str EDGE LEFT: $_" } if (/${qstr}${space_edge}/) { $edges++; print "$str EDGE RIGHT: $_" } unless ($edges) { print "$str UNEDGED: $_"; } } } exit; __END__ Put your foo in your pocket. Put your foo() in your pocket. Good food was had by all. What fools these mortals be! food is good to have. That's a major snafoo there. That's a major snafoo. That's a major snafoo That's a major snafoo()zle That's a major snafoo() there. That's a major snafoo()
When run, that produces this output:
foo EDGE LEFT: Put your foo in your pocket. foo EDGE RIGHT: Put your foo in your pocket. foo EDGE LEFT: Put your foo() in your pocket. () EDGE RIGHT: Put your foo() in your pocket. foo EDGE LEFT: Good food was had by all. foo EDGE LEFT: What fools these mortals be! foo EDGE LEFT: food is good to have. foo EDGE RIGHT: That's a major snafoo there. foo UNEDGED: That's a major snafoo. foo EDGE RIGHT: That's a major snafoo foo UNEDGED: That's a major snafoo()zle () UNEDGED: That's a major snafoo()zle foo UNEDGED: That's a major snafoo() there. () EDGE RIGHT: That's a major snafoo() there. foo UNEDGED: That's a major snafoo() () EDGE RIGHT: That's a major snafoo()
Whether that’s quite what you’re looking for, I cannot say. But you should have enough in your armament now to craft whatever sort of boundary you might desire.
Most of the preceding text is excerpted from the section on “Building Custom Boundaries” beginning on page 308 of the just-released 4th Edition to Programming Perl.
Enjoy.
|
|---|